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Abstract 

In many situations, sample data is obtained from a noisy or imperfect source. In order 
to address such corruptions, this paper introduces the concept of a sampling corrector. Such 
algorithms use structure that the distribution is purported to have, in order to allow one to make 
“on-the-fly” corrections to samples drawn from probability distributions. These algorithms then 
act as filters between the noisy data and the end user. 

We show connections between sampling correctors, distribution learning algorithms, and 
distribution property testing algorithms. We show that these connections can be utilized to 
expand the applicability of known distribution learning and property testing algorithms as well 
as to achieve improved algorithms for those tasks. 

As a first step, we show how to design sampling correctors using proper learning algorithms. 
We then focus on the question of whether algorithms for sampling correctors can be more 
efficient in terms of sample complexity than learning algorithms for the analogous families of 
distributions. When correcting monotonicity, we show that this is indeed the case when also 
granted query access to the cumulative distribution function. We also obtain sampling correctors 
for monotonicity without this stronger type of access, provided that the distribution be originally 
very close to monotone (namely, at a distance 0(l/log^n)). In addition to that, we consider 
a restricted error model that aims at capturing “missing data” corruptions. In this model, we 
show that distributions that are close to monotone have sampling correctors that are significantly 
more efficient than achievable by the learning approach. 

We consider the question of whether an additional source of independent random bits is 
required by sampling correctors to implement the correction process. We show that for correct¬ 
ing close-to-uniform distributions and close-to-monotone distributions, no additional source of 
random bits is required, as the samples from the input source itself can be used to produce this 
randomness. 
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1 Introduction 


Data consisting of samples from distributions is notorious for reliability issues: Sample data can 
be greatly affected by noise, calibration problems or other faults in the sample recording process; 
portions of data may be lost; extraneous samples may be erroneously recorded. Such noise may 
be completely random, or may have some underlying structure. To give a sense of the range of 
difficulties one might have with sample data, we mention some examples: A sensor network which 
tracks traffic data may have dead sensors which transmit no data at all, or other sensors that are 
defective and transmit arbitrary numbers. Sample data from surveys may suffer from response rates 
that are correlated with location or socioeconomic factors. Sample data from species distribution 
models are prone to geographic location errors [HBTB14]. 

Statisticians have grappled with defining a methodology for working with distributions in the 
presence of noise by correcting the samples. If, for example, you know that the uncorrupted 
distribution is Gaussian, then it would be natural to correct the samples of the distribution to the 
nearest Gaussian. The challenge in defining this methodology is: how do you correct the samples 
if you do not know much about the original uncorrupted distribution? To analyze distributions 
with noise in a principled way, approaches have included imputation [LR02, Sch97, STP07] for the 
case of missing or incomplete data, and outlier detection and removal [Haw80, Bar78, IH93] to 
handle “extreme points” deviating significantly from the underlying distribution. More generally, 
the question of coping with the sampling bias inherent to many strategies (such as opportunity 
sampling) used in studying rare events or species, or with inaccuracies in the reported data, is a 
key challenge in many of the natural and social sciences (see e.g. [SL82, SV03, PMG08]). While 
these problems are usually dealt with on a case-by-case basis, drawing on additional knowledge or 
modeling assumptions, no general procedure is known that addresses them in a systematic fashion. 

In this work, we propose a methodology which is based on using known structural properties 
of the distribution to design sampling correetors which “correct” the sample data. Examples of 
structural properties which might be used to correct samples include the property of being bimodal, 
a mixture of several Gaussians, or an independent joint distribution. Within this methodology, the 
main question is how best can one output samples of a distribution in which on one hand, the 
structural properties are restored, and on the other hand, the corrected distribution is close to the 
original distribution? We show that this task is intimately connected to distribution learning tasks, 
but we also give instances in which such tasks can be performed strictly more efficiently. 

1.1 Our model 

We introduce two (related) notions of algorithms to correct distributions: sampling correctors and 
sampling improvers. Although the precise definitions are deferred to Section 3, we describe and 
state informally what we mean by these. In what follows, D is a finite domain, V is any fixed 
property of distributions, i.e., a subset of distributions, over D and distances between distributions 
are measured according to their total variation distance.^ 

A sampling correetor for V is a randomized algorithm which gets samples from a distribution D 
guaranteed to be e-close to having property V, and outputs a sample from a “corrected distribution” 
D which, with high probability, (a) has the property; and (b) is still close to the original distribution 
D (i.e., within distance ei). The sample complexity of such a corrector is the number of samples 

^The total variation distance is defined as dTv{Di,D 2 ) maxscn {Di{S) — D 2 {S)) — 4 “ I^ 2 (a:)| . 
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it needs to obtain from D in order to output one from D. In some settings it may be too much 
to ask for complete correction (or may even not be the most desirable option). For this reason, we 
also consider the weaker notion of sampling improvers, which is similar to a sampling corrector but 
is only required to transform the distribution into a new distribution which is closer to having the 
property V. 

One naive way to solve these problems, the “learning approach,” is to approximate the prob¬ 
ability mass function of D, and find a candidate D £ V. Since we assume we have a complete 
description of D, we can then output samples according to D without further access to D. In 
general, such an approach can be very inefficient in terms of time complexity. However, if there 
is an efficient agnostic proper learning algorithm^ for V, we show that this approach can lead to 
efficient sampling correctors. For example, we use such an approach to give sampling correctors for 
the class of monotone distributions. 

In our model, we wish to optimize the following two parameters of our correcting algorithms: 
The first parameter is the number of samples of D needed to output samples of D. The second 
parameter is the number of additional truly random bits needed for outputting samples of D. Note 
that in the above learning approach, the dependence on each of these parameters could be quite 
large. Although these parameters are not independent of each other (if D is of high enough entropy, 
then it can be used to simulate truly random bits), they can be thought of as complementary, as 
one typically will aim at a tradeoff between the two. Furthermore, a parsimonious use of extra 
random bits may be crucial for some applications, while in others the correction of the data itself 
is the key factor; for this reason, we track each of the parameters separately. For any property V, 
the main question is whether one can achieve improved complexity in terms of these parameters 
over the use of the naive (agnostic) learning approach for V. 

1.2 Our results 

Throughout this paper, we will focus on two particular properties of interest, namely uniformity 
and monotonicity. The first one, arguably one of the most natural and illustrative properties to 
be considered, is nonetheless deeply challenging in the setting of randomness scarcity. As for the 
second, not only does it provide insight in the workings of sampling correctors as well as non-trivial 
connections and algorithmic results, but is also one of the most-studied classes of distributions 
in the statistics and probability literature, with a body of work covering several decades (see 
e.g. [Gre56, Bir87, BKR04, DDS14], or [DDS'''13] for a detailed list of references). Moreover, recent 
work on distribution testing [DDS'’'13, CDGR15] shows strong connections between monotonicity 
and a wide range of other properties, such as for instance log-concavity, Monotone Hazard Risk 
and Poisson Binomial Distributions. This gives evidence that the study of monotone distributions 
may have direct implications for correction of many of these “shape-constrained properties.” 

^Recall that a learning algorithm for a class of distributions C is an algorithm which gets independent samples from 
an unknown distribution D £ C', and on input e must, with high probability, output a hypothesis which is e-close to 
D in total variation distance. If the hypotheses the algorithm produces are guaranteed to belong to C as well, we call 
it a proper learning algorithm. Finally, if the - not-necessarily proper - algorithm is able to learn distributions that 
are only close to C, returning a hypothesis at a distance at most OPT + e from D - where OPT is the distance from D 
to the class, it is said to be agnostic. For a formal definition of these concepts, the reader is referred to Section 4.2 
and Appendix B. 
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Sampling correctors, learning algorithms and property testing algorithms. We begin 
by showing implications of the existence of sampling correctors and the existence of various types 
of learning and property testing algorithms in other models. We first show in Theorem 4.1 that 
learning algorithms for a distribution class imply sampling correctors for distributions in this class 
(under any property to correct) with the same sample complexity, though not necessarily the 
same running time dependency. However, when efficient agnostic proper learning algorithms for 
a distribution class exist, we show that there are efficient sampling correctors for the same class. 
In [Bir87, CDSS14] efficient algorithms for agnostic learning of concise representations for several 
families of distributions are given, including distributions that are monotone, fc-histograms, Poisson 
binomial, and sums of k independent random variables. Not all of these algorithms are proper. 

Next, we show in Theorem 4.4 that the existence of (a) an efficient learning algorithm, as 
e.g. in [ILR12, CDSS13, DDS12, DDO^IS], and (b) an efficient sampling corrector for a class of 
distributions implies an efficient agnostic learning algorithm for the same class of distributions. It 
is well known that agnostic learning can be much harder than non-agnostic learning, as in the latter 
the algorithm is able to leverage structural properties of the class C. 

Our third result in this section, Theorem 4.7, shows that an efficient property tester, an efficient 
distance estimator (which computes an additive estimate of the distance between two distributions) 
and an efficient sampling corrector for a distribution class imply a tolerant property tester with 
complexity equal to the complexity of correcting the number of samples required to run both the 
tester and estimator.^ As tolerant property testing can be much more difficult than property testing 
[GROO, BFR’''10, Pan08, VVll], this gives a general purpose way of getting both upper bounds on 
tolerant property testing and lower bounds on sampling correctors. 

We describe how these results can be employed in Section 4, where we give specific applications in 
achieving improved property testers for various properties. 

Is sampling correction easier than learning? We next turn to the question of whether there 
are natural examples of sampling correctors whose query complexity is more efficient than distri¬ 
bution learning algorithms for the same class. While the sample complexity of learning monotone 
distributions is known to be n(logn) [Bir87] (this lower bound on the sample and query complex¬ 
ity holds even when the algorithm is allowed both to make queries to the cumulative distribution 
function as well as to access samples of the distribution), we present in Section 5.2 an oblivious 
sampling corrector for monotone distributions whose sample complexity is 0(1) and that corrects 
error that is smaller than e < o{l/\o^n}j. This is done by first implicitly approximating the 
distribution by a “histogram” on only a small number of intervals, using ingredients from [Bir87]. 
This (very close) approximation can then be combined, still in an oblivious way, with a carefully 
chosen slowly decreasing distribution, so that the resulting mixture is not only guaranteed to be 
monotone, but also close to the original distribution. 

Assuming a stronger type of access to the unknown distribution - namely, query access to its 
cumulative distribution function (cdf) as in [BDKR05, CR14], we describe in Section 5.3 a sampling 
corrector for monotonicity with (expected) query complexity O(vTog^) which works for arbitrary 
e G (0,1). At a high-level, our algorithm combines the “succinct histogram” technique mentioned 
above with a two-level bucketing approach to correct the distribution first at a very coarse level only 

^Recall that the difference between testing and tolerant testing lies in that the former asks to distinguish whether 
an unknown distribution has a property, or is far from it, while the latter requires to decide whether the distribution 
is close to the property versus far from it. (See Appendix B for the rigorous definition.) 
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(on “superbuckets”), and defer the finer corrections (within a given superbucket) to be made on- 
the-go at query time. The challenge in this last part is that one must ensure that all of these disjoint 
local corrections are consistent with each other ~ and crucially, with all future sample correetions. 
To achieve this, we use a “boundary correction” subroutine which fixes potential violations between 
two neighboring superbuckets by evening out the boundary differences. To make it possible, we use 
rejection sampling to allocate adaptively an extra “budget” to each superbucket that this subroutine 
can use for corrections. 

It is open whether using only samples, i.e., without the more powerful cdf queries, there exist 
sampling correctors for monotone distributions with sample complexity o(log n) that can correct a 
constant error e S (0,1). 

Restricted error models. Since many of the sampling correction problems are difficult to solve 
in general, we suggest error models for which more efficient sampling correction algorithms may 
be available. A first class of error models, which we refer to as missing data errors, is introduced 
in Section 6 and defined as follows - given a distribution over [n], all samples in some interval [i,j] 
for 1 < i < j < n are deleted. Such errors could correspond to samples from a sensor network 
where one of the sensors ran out of power; emails mistakenly deleted by a spam filter; or samples 
from a study in which some of the paperwork got lost. Whenever the input distribution D, whose 
distance from monotonicity is e S (0,1), falls under this model, we give a sampling improver 
that is able to find a distribution both e 2 -close to monotone and 0(e)-close to the original using 
0 ( 1 / 62 ) samples. The improver works in two stages. In the “preprocessing stage,” we detect the 
location of the missing interval (when the missing weight is sufficiently large) and then estimate 
its missing weight, using a “learning through testing” approach from [DDS14] to keep the sample 
complexity under control. In the second stage, we give a procedure by which the algorithm can use 
its knowledge of the estimated missing interval to correct the distribution by rejection sampling. 

Randomness Scarcity. We then consider the case where only a limited amount of randomness 
(other than the input distribution) is available, and optimizing its use, possibly at the cost of 
worse parameters and/or sample complexity of our sampling improvers, is crucial. This captures 
situations where generating the random bits the algorithm use is either expensive‘s (as in the 
case of physical implementations relying on devices, such as Geiger counters or Zener diodes) or 
undesirable (e.g., when we want the output distribution to be a deterministic function of the input 
data, for the sake of reproducibility or parallelization). We focus on this setting in Section 7, and 
provide sampling correctors and improvers for uniformity that use samples only from the input 
distribution. For example, we give a sampling improver that, given access to distribution e-close to 
uniform, grants access to a distribution 62 -close to uniform distribution and has constant sample 
complexity Oe_£ 2 (l). We achieve this by exploiting the fact that the uniform distribution is not 
only an absorbing element for convolution in Abelian groups, but also an attractive fixed point with 
high convergence rate. That is, by convolving a distribution with itself (i.e., summing independent 
samples modulo the order of the group) one gets very quickly close to uniform. Combining this idea 
with a different type of improvement (based on a von Neumann-type “trick”) allows us to obtain 
an essentially optimal tradeoff between closeness to uniform and to the original distribution. 

■^On this topic, see for instance the discussion in [KR94, IZ89], and references therein. 
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1.3 Previous work 


Dealing with noisy or incomplete datasets has been a challenge in Statistics and data sciences, 
and many methods have been proposed to handle them. One of the most widely used, multiple 
imputation (one of many variants of the general paradigm of imputation) was first introduced by 
Rubin [Rub87] and consists of the creation of several complete datasets from an incomplete one. 
Specifically, one first obtains these new datasets by filling in the missing values randomly according 
to a maximum likelihood distribution computed from the observations and a modeling assumption 
made on the data. The parameters of this model are then updated using the new datasets and the 
ML distribution is computed again. This resembles the Expectation-Maximization (EM) algorithm, 
which can also be used for similar problems, as e.g. in [DLR77]. After a few iterations, one can 
get both accurate parameter estimates and the right distribution to sample data from. Assuming 
the assumptions chosen to model the data did indeed reflect its true distribution, and that the 
number of these new datasets was large enough, this can be shown to yield statistically accurate 
and unbiased results [Sch97, LR02]. 

From a Theoretical Computer Science perspective, the problem of local correction of data has 
received much attention in the contexts of self-correcting programs, locally correctable codes, and 
local filters for graphs and functions over [n]*^ (some examples include [BLR90, YeklO, ACCL08, 
SSIO, BGJ'*'12, JRll]). To the best of our knowledge, this is the first work to address the correction 
of data from distributions. (We observe that Chakraborty et al. consider in [CGMll] a different 
question, although of a similar distributional flavor: namely, given query access to a Boolean 
function /: {0,1}” ^ {0,1} which is close to a A:-junta /*, they show how to approximately 
generate uniform PAG-style samples of the form {x,g*{x)) where x € {0,1}^ and g* is the function 
underlying f*. They then describe how to apply this “noisy sampler” primitive to test whether a 
function is close to being a junta.) 

In this work, we show that the problem of estimating distances between distributions is related. 
There has been much work on this topic, but we note the following result: [DDS'*“13] show how to 
estimate the total variation distance between fc-modal probability distributions.^ The authors give 
a reduction of their problem into one with logarithmic size, using a result by Birge on monotone 
distributions [Bir87]. In particular, one can partition the domain D = [n] into logn/e intervals in 
a oblivious way, such that the “flattening” of any monotone distribution according to that interval 
is 0(e)-close to the original one. We use similar ideas in order to obtain some of the results in the 
present paper. 

It is instructive to compare the goal of our model of distribution sampling correctors to that of 
extractors: in spite of many similarities, the two have essential differences and the results are in 
many cases incomparable. We defer this discussion to Section 7.1. 

2 Preliminaries 

Hereafter, we write [n] for the set {I,... ,n}, and log for the logarithm in base 2. A probability 
distribution over a finite domain D is a non-negative function D: D —)• [0,1] such that — 

I; we denote by Uq the uniform distribution on D. Moreover, given a distribution D over D and a 
set 5 C D, we write D{S) for the total probability weight D{x) assigned to S by D. 

probability distribution D is k-modal if there exists a partition of [n] in k intervals such that D is monotone 
(increasing or decreasing) on each. 
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Previous tools from probability. As previously mentioned, in this work we will be concerned 
with the total variation distance between distributions. Of interest for the analysis of some of 
our algorithms, and assuming 0 is totally ordered (in our case, = [n]), one can also define the 
Kolmogorov distance between Di and D 2 as 

dK(A'i,A» 2 ) = max |Fi(x) - ^ 2 ( 2 :)! (1) 

where Fi and F 2 are the respective cumulative distribution functions (cdf) of Di and L> 2 - Thus, the 
Kolmogorov distance is the distance between the cdfs; and (1k{Di,D2) < dTv(T*i,I? 2 ) € [0,1]. 


We first state the following theorem, which guarantees that for any two distributions Di, D 2 , 
applying any (possibly randomized) function to both Di and D 2 can never increase their total 
variation distance: 

Fact 2.1 (Data Processing Inequality for Total Variation Distance). Let Di, D 2 he two distributions 
over a domain Ll. Fix any randomized function^ F on D, and let F{Di) be the distribution such 
that a draw from F{Di) is obtained by drawing independently x from Di and f from F and then 
outputting f{x) (likewise for F{D 2 )). Then we have 

dTv(T(Di), F(D2)) < dTv(T>i,D2). 


Finally, we recall below a fundamental fact from probability theory that will be useful to us, 
the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality. Informally, this result says that one can learn 
the cumulative distribution function of a distribution up to an additive error e in £00 distance, by 
taking only 0(l/e^) samples from it. 

Theorem 2.2 ([DKW56, Mas90]). Let D be a distribution over [n]. Given m independent samples 
xi,... ,Xm from D, define the empirical distribution D as follows: 


|{ i g H : 


m 


Then, for all e > 0, Pr dx^^D, > e j < 2e where the probability is taken over the samples. 

In particular, setting m = we get that dxl^D, < e with probability at least 1 — d. 


Monotone distribntions. We say that a distribution D on [n] is monotone (non-increasing) 
if its probability mass function is non-increasing, that is if D(l) > • • • > D{n). When dealing 
with monotone distributions, it will be useful to consider the Birge decomposition, which is a way 
to approximate any monotone distribution D by a histogram, where the latter is supported by 
logarithmically many intervals which crucially do not depend on D itself: 


Definition 2.3 (Birge decomposition). Given a parameter a > 0, the corresponding (oblivious) 
Birge decomposition of [n] is the partition Ta = {h, ■ ■ ■, h), where i = and 


141 


(1 -Fa)^J, 1 < A: < f. 


®Which can be seen as a distribution over functions over Q. 
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For a distribution D and parameter a, define ^aiD) to be the “flattened” distribution with rela¬ 
tion to the oblivious decomposition la, as ^a{D){i) = D{Ik)/\Ik\ for all k S [P\ and i G Ik- A 
straightforward computation (see Appendix C) shows that such flattening can only decrease the 
distance between two distributions, i.e. 

dTv(‘ha(A)i), <hQ,(Z)2)) < dTv(A>i,-D2)- (2) 

The next theorem states that every monotone distribution can be well-approximated by its 
“flattening” on the Birge decomposition’s intervals: 

Theorem 2.4 ([Bir87, DDS'''13]). If D is monotone, then dT^Y{D,^a{D)) < «• 

As a corollary, one can extend the theorem to distributions only promised to be close to monotone: 


Corollary 2.5. Suppose D is e-close to monotone, and let a > 0. Then dTv (^5 <2e-\-a. 
Furthermore, 4>q(T>) is also e-close to monotone. 

Access to the distributions. While we will mostly be concerned in this work with the standard 
model of access to the probability distributions, where the algorithm is provided with independent 
samples from an unknown distribution D, the concepts we introduce and some of our results apply 
to some other types of access as well. One in particular, the Cumulative Dual access model, grants 
the algorithms the ability to query the value of the cumulative distribution function (cdf) of D, 
in addition to regular sampling.^ (We observe, as in [CR14], that this type of query access is for 
instance justified when the distribution originates from a sorted dataset, in which case such queries 
can be implemented with only a logarithmic overhead.) 

Unless explicitly specified otherwise, our algorithms only assume standard sampling access; the 
formal definitions of the two models mentioned above can be found in Appendix B. 

3 Our model: definitions 

In this section, we state the precise definitions of sampling correctors, improvers and batch sampling 
improvers. To get an intuition, the reader may think for instance of the parameter ei below as 
being 2e, and the error probability 5 as 1/3. Although all definitions are presented in terms of the 
total variation distance, analogous definitions in terms of other distances can also be made. 

Definition 3.1 (Sampling Corrector). Fix a given property V of distributions on D. An {e,ei)- 
sampling corrector for V is a randomized algorithm which is given parameters e, ei E (0, 1] such 
that ei > e and 6 E [0,1], as well as sampling access to a distribution D. Under the promise that 
dTy{D,V) < e, the algorithm must provide, with probability at least 1 — 5 over the samples it 
draws and its internal randomness, sampling access to a distribution D such that 

(i) D is close to D: dTY(j^,D^ < ej; 

(ii) D has the property: D gV. 

In other terms, with high probability the corrector will simulate exactly a sampling oracle for D. 
The query complexity q = q{e,ei,5,Il) of the algorithm is the number of samples from D it takes 
per query in the worst case. 

^See also [Canl5] for a summary and comparison of the different existing access models. 
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One can define a more general notion, which allows the algorithm to only get “closer” to the 
desired property, and convert some type of access ORACLEi into some other type of access ORACLE 2 
(e.g., from sampling to evaluation access): 

Definition 3.2 (Sampling Improver (general definition)). Fix a given property V over distribu¬ 
tions on n. A sampling improver for V (from ORACLEi to ORACLE 2 ) is a randomized algorithm 
which, given parameter e G (0,1] and ORACLEi access to a distribution D with the promise that 
dTy{D,V) < e as well as parameters £ 1,82 G [0,1] satisfying £ 1 - 1-62 > £, provides, with probability 
at least 1—5 over the answers from ORACLEi and its internal randomness, ORACLE 2 access to a 
distribution D such that 


dTv(.D, < £1 

(Close to D) 

< £2 

(Close to V) 


In other terms, with high probability the corrector will simulate exactly ORACLE2 access to D. 
The query complexity q = q(£, 61 , £ 2 , 5, O) of the algorithm is the number of queries it makes to 
ORACLEi in the worst case. 

Finally, one may ask for such an improver to provide many samples from the (same) improved 
distribution,® where “many” is a number committed in advance. We refer to such a corrector as a 
batch sampling improver: 

Definition 3.3 (Batch Sampling Improver). For V, D, £, £ 1,62 G [0,1] as above, and parameter 
m G N, a batch sampling improver for V (from ORACLEi to ORACLE2) is a sampling improver 
which provides, with probability at least 1 — 5, ORACLE2 access to D for as many as m queries, 
in between which it is allowed to maintain some internal state ensuring consistency. The query 
complexity of the algorithm is now allowed to depend on m as well. 

Note that, in particular, when providing sampling access to D the batch improver must guarantee 
independence of the m samples. When £2 is set to 0 in the above definition, we will refer to the 
algorithm as a batch sampling corrector. 

4 Connections to learning and testing 

In this section, we draw connections between sampling improvers and other areas, namely testing 
and learning. These connections shed light on the relation between our model and these other 
lines of work, and provide a way to derive new algorithms and impossibility results for both testing 
or learning problems. (For the formal definition of the testing and learning notions used in this 
section, the reader is referred to Appendix B and the relevant subsections.) 

4.1 From learning to correcting 

As a first observation, it is not difficult to see that, under the assumption that the unknown 
distribution D belongs to some specific class C, correcting (or improving) a property V requires at 

®Indeed, observe that as sampling correctors and improvers are randomized algorithms with access to their “own” 
coins, there is no guarantee that fixing the input distribution D would lead to the same output distribution D. This 
is particularly important when providing other types of access (e.g., evaluation queries) to D than only sampling. 



most as many samples as learning the class C; that is, learning (a class of distributions) is at least 
as hard as correcting (distributions of this class). Here, V and C need not be related. 

Indeed, assuming there exists a learning algorithm C, for C, it then suffices to run C, on the 
unknown distribution D G C to learn (with high probability) a hypothesis D such that D and D 
are at most at distance In particular, D is at most ^J^-far from V. One can then (e.g., by 

exhaustive search) find a distribution D in V which is closest to D (and therefore at most ei-far 
from D), and use it to produce as many “corrected samples” as wanted: 


Theorem 4.1. Let C a class of probability distributions over Ll. Suppose there exists a learning 
algorithm C, for C with sample complexity qc- Then, for any property V of distributions, there 
exists a (not-necessarily computationally efficient) sampling corrector for V with sample complexity 
q{e,£i,5) = qc {^^^,6), under the promise that D gC. 

Furthermore, if the (efficient) learning algorithm C has the additional guarantee that its hy¬ 
pothesis class is a subset of V (i.e., the hypotheses it produces always belong to V) and that the 
hypotheses it contains allow efficient generation of samples, then we immediately obtain a com¬ 
putationally efficient sampling corrector: indeed, in this case D ^ V already. Furthermore, as 
mentioned in the introduction, when efficient agnostic proper learning algorithms for distribution 
classes exist, then there are efficient sampling correctors for the same classes. It is however worth 
pointing out that this correcting-by-learning approach is quite inefficient with regard to the amount 
of extra randomness needed: indeed, every sample generated from D requires fresh new random 
bits. 

To illustrate this theorem, we give two easy corollaries. The first follows from Chan et al., who 
showed in [CDSS13] that monotone hazard risk distributions can be learned to accuracy e using 
0(logn/e^) samples; moreover, the hypothesis obtained is a 0(log(n/e)/e^)-histogram. 


Corollary 4.2. Let C be the class of monotone hazard risk distributions over [n\, and V be the 
property of being a histogram with (at most) ^Jn pieces. Then, under the promise that D £ C and 


as long as e = fl{l/y/n), there is a sampling corrector for V with sample complexity Oy 
Our next example however demonstrates that this learning approach is not always optimal: 


Corollary 4.3. Let C be the class of monotone distributions over [n], and V be the property of 
being a histogram with (at most) y/n pieces. Then, under the promise that D G C and as long as 
£ = ^{1/y/n), there is a sampling corrector for V with sample complexity 

Indeed, for learning monotone distributions 0(logn/e^) samples are known to be necessary and 
sufficient [Bir87]. Yet, one can also correct the distribution by simulating samples directly from 
its flattening on the corresponding Birge decomposition (as per Definition 2.3); and every sample 
from this correction-by-simulation costs exactly one sample from the original distribution. 


4.2 From correcting to agnostic learning 

Let C and TL be two classes of probability distributions over Ll. Recall that a (semi-)agnostic 
learner for C (using hypothesis class LL) is a learning algorithm A which, given sample access to 
an arbitrary distribution D and parameter e, outputs a hypothesis D £ LL such that, with high 
probability, D does “as much as well as the best approximation from C:” 

dxv {j^i < c • OPTc^o -|- 0(e) 
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where OPTc,_d = infD csC dTv(-DC)-D) and c > 1 is some absolute constant (if c = 1, the learner is 
said to be agnostic). 

We first describe how to combine a (non-agnostic) learning algorithm with a sampling corrector 
in order to obtain an agnostic learner, under the strong assumption that a (rough) estimate of opt 
is known. Then, we explain how to get rid of this extra requirement, using machinery from the 
distribution learning literature (namely, an efficient hypothesis selection procedure). 

Theorem 4.4. Let C he as above. Suppose there exists a learning algorithm C, for C with sample 
complexity qc, and a batch sampling corrector A for C with sample complexity Suppose further 
that a constant-factor estimate OPT o/OPT c,d is known (up to a multiplicative c). 

Then, there exists an agnostic learner for C with sample complexity q{e,5) = (opt, opt-|- 
|)) |) (where the constant in front of OPTc,_d is c). 

Proof. Let c be the constant such OF'Tc,d < opt < c • OPTc^£). The agnostic learner CJ for V, on 
input e S (0,1], works as follows: 

- Run Aon D with parameters (opt, opt-|- e, |) to get qci^, |) samples distributed according 
to some distribution D. 

- Run C on these samples, with parameters e, |, and output its hypothesis D. 

We hereafter condition on both algorithms succeeding (which, by a union bound, happens with 
probability at least 1 — (5). Since D is OPT-close to C, and therefore by correctness of the sampling 
corrector we have both D ^ C and dxv^L),!)^ < opt -|- e. Hence, the output D of the learning 

algorithm satisfies < e, which implies 

dxv - OPT -\-2e <c - optc,d + 2e (3) 

for some absolute constant c, as claimed (using the assumption on opt). □ 

It is worth noting that in the case the learning algorithm is proper (meaning the hypotheses it 
outputs belong to the target class C: that is, TL C C), then so is the agnostic learner obtained with 
Theorem 4.4. This turns out to be a very strong guarantee: specifically, getting (computationally 
efficient) proper agnostic learning algorithms remains a challenge for many classes of interest - see 
e.g. [DDS12], which mentions efficient proper learning of Poisson Binomial Distributions as an 
open problem. 

We stress that the above can be viewed as a generic framework to obtain efficient agnostic 
learning results from known efficient learning algorithms. For the sake of illustration, let us con¬ 
sider the simple case of Binomial distributions: it is known, for instance as a consequence of the 
aforementioned results on PBDs, that learning such distributions can be performed with 0(l/e^) 
samples (and that H(l/e^) are required). Our theorem then provides a simple way to obtain ag¬ 
nostic learning of Binomial distributions with sample complexity 0(l/e^): namely, by designing 
an efficient sampling corrector for this class with sample complexity poly(log log ^). 

Corollary 4.5. Suppose there exists a batch sampling corrector A for the class B of Binomial 
distributions over [n], with sample complexity g^(e,ei, m, 5) = polylog(i, m, |). Then, there 
exists a semi-agnostic learner for B, which, given access to an unknown distribution D promised to 
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be e-close to some Binomial distribution, takes samples from D and outputs a distribution 

13 £ B such that 

dTv(-C)) ^ 

with probability at least 2/3. 

To the best of our knowledge, an agnostic learning algorithm for the class of Binomial distributions 
with sample complexity 0(l/e^) is not explicitly known, although the results of [CDSS14] do imply 
a 0(l/e^) upper bound and a modification of [DDS12] (to make their algorithm agnostic) seems 
to yield one. The above suggests an approach which would lead to the (essentially optimal) sample 
complexity. 


4.2.1 Removing the assumption on knowing opt 

In the absence of such an estimate opt within a constant factor of OPTc,d given as input, one 
can apply the following strategy, inspired of [CDSX14, Theorem 6]. In the first stage, we try to 
repeatly “guess” a good OPT, and run the agnostic learner of Theorem 4.4 with this value to obtain 
a hypothesis. After this stage, we have generated a succinct list B of hypotheses, one for each opt 
that we tried: the second stage is then to run a hypothesis selection procedure to pick the best 
h £T-i: as long as one of the guesses was good, this h will be an accurate hypothesis. 

More precisely, suppose we run the agnostic learner of Theorem 4.4 a total of log(l/e) times, 
setting at the iteration OPT^, 2^e and 5' 5/(2log(l/e)). For the first k such that 2^~^e < 

OPTc,d < 2^6, OPTfc is in [optc,D)2 • OPTc,_d]- Therefore, by a union bound on all runs of the 
learner at least one of the hypotheses Dk will have the agnostic learning guarantee we want to 
achieve; i.e. will satisfy (3), with c = 2. 

Conditioned on this being the case, it remains to determine which hypothesis achieves the guar¬ 
antee of being (2oPT-|-0(e))-close to the distribution D. This is where we apply a hypothesis selec¬ 
tion algorithm - a variant of the similar “tournament” procedures from [DLOl, DK14, AJOS14] - to 
our N = log(l/e) candidates, with accuracy parameter e and failure probability 6/2. This algorithm 
has the following guarantee: 


Proposition 4.6 ([KamlS]). There exists a procedure Tournament that, given sample access to 
an unknown distribution D and both sample and evaluation access to N hypotheses Hi, ..., H^, has 
the following behavior. Tournament makes a total of O{\og{N/6)/e'^) queries to D, Hi,..., H^, 
runs in time O[N log{N/5)/e'^), and outputs a hypothesis Hi such that, with probability at least 
1 - 6 , 

dTY{D,Hi) < 9.1 min dTM{D,Hj) 0{e). 

ie[Af] 


Summary. Using this result in the approach outlined above, we get with probability at least 
1 — d, we will obtain a hypothesis doing “almost as well as the best IZfc”; that is, 

dTV^ZZ, Dk*'j < 18.2 • OPTc,D + 0(e) 

The overall sample complexity is 


logfl/e) 

QA 

k=l 


(2*£,(2‘ + l)e,K 



41og(l/e)y 


41og(l/e)) j) 
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where the first term comes from the log(l/e) runs of the learner from Theorem 4.4, and the second 
is the overhead due to the hypothesis selection tournament. 


4.3 From correcting to tolerant testing 

We observe that the existence of sampling correctors for a given property V, along with an effi¬ 
cient distance estimation procedure, allows one to convert any distribution testing algorithm into 
a tolerant distribution testing algorithm. This is similar to the connection between “local recon¬ 
structors” and tolerant testing of graphs described in [Bra08, Theorem 3.1] and [CGR12, Theorem 
3.1]. That is, if a property V has both a distance estimator and a sampling corrector, then one can 
perform tolerant testing of V in the time required to generate enough corrected samples for both 
the estimator and a (non-tolerant) tester. 

We first state our theorem in all generality, before instantiating it in several corollaries. For the 
sake of clarity, the reader may wish to focus on these on a first pass. 


Theorem 4.7. Let C be a class of distributions, and V C a property. Suppose there exists an 
{e, El)-batch sampling corrector A for V with complexity and a distance estimator £ for C with 
complexity qg - that is, given sample access to Di,D 2 G C and parameters e, 6, £ draws qs{£,5) 
samples from Di,D 2 and outputs a value d such that d — dTv(-C*i) D 2 ) < £ with probability at least 
1 - 5. 

Then, from any property tester T for V with sample complexity q-p, one can get a tolerant tester 
T' with query complexity q{e',e,6) = 0(e), |) + 9 t(‘ 


( £-£' &\ S 
4 ’ 3i’ 3 


Proof. The tolerant tester T' for V, on input 0 < e' < e < 1, works as follows, setting fd 
and El e' -j- /3: 

- Run A on D with parameters (e',Ei,S/3) to get qs{f3,6/3) + q'j-il3,d/3) samples distributed 
according to some distribution D. Using these samples: 

1. Estimate dTv(-C), to within an additive (3, and REJECT if this estimate is more than 
El+13 = 

2. Otherwise, run T on D with parameter (3 and accept if and only if T outputs ACCEPT. 

We hereafter condition on all 3 algorithms succeeding (which, by a union bound, happens with 
probability at least 1 — d). 

If D is e'-close to V, then the corrector ensures that D is ei-close to D, so the estimate of dxv 
is at most Ei + (3: Step 1 thus passes, and as D G V the tester outputs ACCEPT in Step 2. 

On the other hand, if D is e-far from V, then either (a) dTv(^, > £i + 2/3 (in which case we 

output REJECT in Step 1, since the estimate exceeds £i +/3), or (b) dTv(-D,^) > (^1 + 2/3) = (3, 

in which case T outputs REJECT in Step 2. □ 


Remark 4.8. Only asking that the distance estimation procedure £ be specific to the class C is 
not innocent; indeed, it is known ([VVll]) that for general distributions, distance estimation has 
sample complexity However, the task becomes significantly easier for certain classes of 

distributions: and for instance can be performed with only 0 {k\ogn) samples, if the distributions 
are guaranteed to be /c-modal [DDS'*“13]. This observation can be leveraged in cases when one 
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knows that the distribution has a specific property, but does not quite satisfy a second property: 
e.g. is known to be /c-modal but not known to be, say, log-concave. 

The reduction above can be useful both as a black-box way to derive upper bounds for tolerant 
testing, as well as to prove lower bounds for either testing or distance estimation. For the first 
use, we give two applications of our theorem to provide tolerant monotonicity testers for /c-modal 
distributions. The first is a conditional result, showing that the existence of good monotonicity 
correctors yield tolerant testers. The second, while unconditional, only guarantees a weaker form 
of tolerance (guaranteeing acceptance only of distributions that are very close to monotone); and 
relies on a corrector we describe in Section 5.2. As we detail shortly after stating these two results, 
even this weak tolerance improves upon the one provided by currently known testing algorithms. 

Corollary 4.9. Suppose there exists an {e,ei)-batch sampling corrector for monotonicity with com¬ 
plexity q. Then, for any k = 0(logn/loglogn), there exists an algorithm that distinguishes whether 
a k-modal distribution is (a) e-close to monotone or (b) 5e-far from monotone with success proba¬ 
bility 2/3, and sample complexity 


g e,2e,C 


k log n 1 
log log n ’ 9 


where C is an absolute constant. 


Proof. We combine the distance estimator of [DDS'''13] with the monotonicity tester of [DDS14, 
Section 3.4], which both apply to the class of fe-modal distributions. As their respective sample 
complexity is, for distance parameter a and failure probability 5, iog\°fclog n ) ) and 

of ^ log j), the choice of parameters (d = 1/3, e and 5e) and the assumption on k yield 


O 




k logn 
logf/clogn) 


/ k log n \ 
Ve^ log(/clogn) J 


and we obtain by Theorem 4.7 a tolerant tester with sample complexity q(^e, 2e, Q( g 4 io^g^(Tiog n ) ) ’ i) ’ 
as claimed. □ 

Another application of this theorem, but this time taking advantage of a result from Section 5.1, 
allows us to derive an explicit tolerant tester for monotonicity of A:-modal distributions: 


Corollary 4.10. For any A: > 1, there exists an algorithm that distinguishes whether a k-modal dis¬ 
tribution is (a) o{^e^/ log^ -close to monotone or (b) e-far from monotone with success probability 
2/3, and sample complexity 

/I A logn 
log(Alogn) 

In particular, fork = Oflogn/loglogn) this yields a (weakly) tolerant tester with sample complexity 

1 fclogn \ 
loglogn J ■ 

Proof. We again use the distance estimator of [DDS"*'13] and the monotonicity tester of [DDS14], 
which both apply to the class of A-modal distributions, this time with the monotonicity corrector 
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we describe in Corollary 5.5, which works for any si and e = O^ef/log^nj and has constant- 
rate sample complexity (that is, it takes 0 {q) samples from the original distribution to output 
q samples). Similarly to Corollary 4.9, the sample complexity is a straightforward application of 
Theorem 4.7. □ 

Note that, to the best of our knowledge, no tolerant tester for monotonicity of /c-modal distri¬ 
butions is known, though using the (regular) 0(A:/e^)-sample tester of [DDS14] and standard argu¬ 
ments, one can achieve a weak tolerance on the order of 0(e^//c). While the sample complexity ob¬ 
tained in Corollary 4.10 is worse by a polylog(n) factor, it has better tolerance for k = h2(log^ n/e). 


5 Sample complexity of correcting monotonicity 

In this section, we focus on the sample complexity aspect of correcting, considering the specific 
example of monotonicity correction. As a first result, we show in Section 5.1 how to design a 
simple batch corrector for monotonicity which, after a preprocessing step costing logarithmically 
many samples, is able to answer an arbitrary number of queries. This corrector follows the “learning 
approach” described in Section 4.1, and in particular provides a very efficient way to amortize the 
cost of making many queries to a corrected distribution. 

A natural question is then whether one can “beat” this approach, and correct the distribution 
without approximating it as a whole beforehand. Section 5.2 answers it by the affirmative: namely, 
we show that one can correct distributions that are guaranteed to be ( 1 / log^ n)-close to monotone 
in a completely oblivious fashion, with a non-adaptive approach that does not require to learn 
anything about the distribution. 

Finally, we give in Section 5.3 a corrector for monotonicity with no restriction on the range of 
parameters, but assuming a stronger type of query access to the original distribution. Specifically, 
our algorithm leverages the ability to make cdf queries to the distribution D, in order to generate 
independent samples from a corrected D. This sampling corrector also outperforms the one from 
Section 5.1, making only 0[y/\og n) queries per sample on expectation. 

5.1 A natural approach: correcting by learning 

Our first corrector works in a straightforward fashion: it learns a good approximation of the 
distribution to correct, which is also concisely represented. It then uses this approximation to 
build a sufficiently good monotone distribution M' “offline,” by searching for the closest monotone 
distribution, which in this case can be achieved via linear programming. Any query made to the 
corrector is then answered according to the latter distribution, at no additional cost. 

Lemma 5.1 (Correcting by learning). Fix any constant c > 0. For any e, ei > (3 -|- c)e and 
62 = 0 as in the definition, any type of oracle ORACLE and any number of queries m, there exists a 
sampling corrector for monotonicity from sampling to ORACLE with sample complexity 0(logn/e^). 

Proof. Consider the Birge decomposition Xa = (Ii,..., fi) with parameter a y which partitions 
the domain [n] into intervals. By Corollary 2.5 and the learning result of [Bir87], we can 

learn with samples a -histogram D such that: 

dTv(-0).0^ < 26-|-a. (4) 
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Also, let M be the closest monotone distribution to D. From Eq. (2), we get the following: letting 
M denote the set of monotone distributions, 

dTv(^,VW) =dTv($a(^),VW) <dTv($a(^),^a(M)) <dTv(^,M) <£ (5) 

where the first inequality follows from the fact that <he(M) is monotone. Thus, D is e-close to 
monotone, which implies that D' is (e + Q;)-close to monotone. Furthermore, it is easy to see that, 
without loss of generality, one can assume the closest monotone distribution D' to be piecewise 
constant with relation to the same partition (e.g., using again Eq. (2)). It is therefore sufficient to 
find such a piecewise constant distribution: to do so, consider the following linear program which 
finds exactly this: a monotone M', closest to D' and piecewise constant on Xq,: 


minimize ^ 


— 


D'ih 


IT 


• IT- 


subject to 1 > Xl > X 2 > • • • > X/ > 0 
£ 

I I ~ 1 


This linear program has variables and so it can be solved in time poly (log n, d) . 

After finding a solution {xj)j^yp^ to this linear program,® we dehne the distribution M' \ [n] 

[0,1] as follows: M'{i) = Xind(i)) where ind(i) is the index of the interval of Xa which i belongs to. 
This implies that 

dx-v^T^, Oi 

and by the triangle inequality we finally get: 

dT-v(T, M*) < dx-v^T, -|- dx-v^T, D'^ + dx-v^T^, M'^ < Se -|- 3a = (3 + c)e. 

□ 


5.2 Oblivious correcting of distributions which are very close to monotone 

We now turn to our second monotonicity corrector, which achieves constant sample complexity for 
distributions already (1/ log^ n)-close to monotone. Note that while this is a very strong assumption 
(as samples from such a distribution D are essentially indistinguishable from samples originating 
from its closest monotone distribution, unless n(log^ n) of them are taken already), our construction 
actually yields a stronger guarantee: namely, given evaluation (query) access to D, it can answer 
evaluation queries to the corrected distribution as well. (See Remark 5.6 for a more detailed 
statement.) 

The high-level idea is to treat the distribution as a A:-histogram on the Birge decomposition 
(for k = O(logn)), thus “implicitly approximating” it; and to correct this histogram by adding a 
certain amount of probability weight to every interval, so that each gets slightly more than the 

^To see why a good solution always exists, consider the closest monotone distribution to T, and apply 3>a to it. 
This distribution satisfies all the constraints. 
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next one. By choosing these quantities carefully, this ensures that any violation of monotonicity 
gets corrected in the process, without ever having to find out where they actually occur. 

We start by stating the general correcting approach for general A:-histograms satisfying a certain 
property (namely, the ratio between two consecutive intervals is constant). 

Lemma 5.2. Let X = (Ii,..., Ik) be a decomposition of [n] in consecutive intervals such that 
\Ij+i\ / \Ij\ = 1 + c for all j, and D be a k-histogram distribution on X that is e-close to monotone. 
Then, there is a monotone distribution D which can be sampled from in constant time given oracle 
access to D, such that = 0{ek‘^). Further, D is also a k-histogram distribution onX. 

Proof. We will argue that no interval can have significantly more total weight than the previous 
one, as it would otherwise contradict the bound on the closeness to monotonicity. This bound on 
the “jump” between two consecutive intervals enables us to define a new distribution D which is a 
mixture of D with an arithmetically decreasing A:-histogram (which only depends on e and fe); it 
can be shown that for the proper choice of parameters, D is now monotone. 

We start with the following claim, which leverages the distance to monotonicity in order to give 
a bound on the total violation between two consecutive intervals of the partition: 

Claim 5.3. Let D be a k-histogram distribution on X that is e-close to monotone. Then, for any 
j ^ _xj. 

< (1 + c)D{Ij) + e(2 + c). (6) 

Proof. First, observe that without loss of generality, one can assume the monotone distribution 
closest to D to be a /c-histogram onX as well (e.g., by a direct application of Fact 2.1 to the flattening 
on X of the monotone distribution closest to D). Assume there exists an index j G {1, ... ,k — 1} 
contradicting (6); then, 


bi+il bjl Ki+il l-'j+il 


I ^ 2 + c 


But any monotone distribution M which is a A:-histogram on X must satisfy so 

that at least e(2 + c) total weight has to be “redistributed” to fix this violation. Indeed, it is not 
hard to see^^ that the minimum amount of probability weight to “move” in order to do so is at least 
what is needed to uniformize D on Ij and Ij+i. This latter process yields a distribution D' which 
puts weight {D{Ij) + D{Ij^i))/{{2 + c)\Ij\) on each element of Ij U Ij+i, and the total variation 
distance between D and D' (a lower bound on its distance to monotonicity) is then 


dTY{D,D') 


+ D{Ij) _ ^ -^(-^j+i) ~ (1 + + c) ^ ^ 

2 + c ^ 2 + c 2 + c 


which is a contradiction. □ 

^°E.g., by writing the £i cost as the sum of the weight added/removed from “outside” the two buckets and the 
weight moved between the two buckets in order to satisfy the monotonicity condition, and minimizing this function. 
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This suggests immediately the following correcting scheme: to output samples according to D, 
fe-histogram on Z defined by 

D{Ik) = X{D{Ik)) 

D{h-i) = A {D{Ik-i) + (2 + c)s) 

D{Ik-j) = A {D{Ik-j) + j{2 + c)£) 


that is 


Dilj) 


A 


fc-i 


D{I,)+eY. 


l=J 



l<j<k 


where the normalizing factor is A ^1 + £(2 + by Claim 5.3, adding weight 

decreasing by (2 + c)e at each step fixes any pair of adjacent intervals whose average weights are 
not monotone, D/\ is a non-increasing non-negative function. The normalization by A preserving 
the monotonicity, D is indeed a monotone distribution, as claimed. 

It only remains to bound dTY{D,D): 


2dTv(^, 5) = E = E 

i=i i=i 


(1 - X)DiI,) 


fc fc_1 

<(l-A)E^Ui) + A£EE(2 + c) = l 

j=l j=l i=j 


k—1 

Ae E(2 + c) 

i=j 

1 -e(2-hc)^^^ 
l + e(2 + c)^i^' 


Finally, note that ZD is a mixture of D (with weight A) and an explicit arithmetically non-increasing 
distribution; sampling from D is thus straightforward, and needs at most one sample from D for 
each draw. □ 


Remark 5.4. The above scheme can be easily adapted to the case where the ratio between consecu¬ 
tive intervals is not always the same, but is instead |Zj+i| / \ Ij\ = 1+Cj for some known Cj S [Ci, C 2 ]; 
the result then depends on the ratio C 2 /C 1 = 0(1) as well. 

As a direct corollary, this describes how to correct distributions which are promised to be (very) 
close to monotone, in a completely oblivious fashion: that is, the behavior of the corrector does 
not depend on what the input distribution is; furthermore, the probability of failure is null (i.e., 
5 = 0). 

Corollary 5.5 (Oblivious correcting of monotonicity). For every e' E (0,1), there exists an (obliv¬ 
ious) sampling corrector for monotonicity, with parameters e = O / log^ , e\ = e' and sample 
complexity 0(1). 

Proof. We will apply Lemma 5.2 for k = Oilogn/e') and Z being the corresponding Birge decom¬ 
position (with parameter s' jZ). The idea is then to work with the “flattening” D of D: since D is 
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e-close to monotone, it is also (e'/2)-close, and D is both (eY2)-close to D and e-close to monotone. 
Applying the correcting scheme with our value of k and c set to e', the corrected distribution D is 
monotone, and 


dTv(.C)) 


1 -e(2-Fe')^^^ 
l + e(2 + e')^^^ 



where the last inequality derives from the fact that /c^e = O(e'). This in turn implies by a triangle 
inequality that D is enclose to D. Finally, observe that, as stated in the lemma, D can be easily 
simulated given access to D, using either 0 or 1 draw: indeed, D is a mixture with known weights 
of an explicit distribution and D, and access to the latter can be obtained from D. □ 


Remark 5.6. An interesting feature of the above construction is that does not only yields a 0(1)- 
query corrector from sampling to sampling: it similarly implies a corrector from ORACLE to OR¬ 
ACLE with query complexity 0(1), for ORACLE being (for instance) an evaluation or Cumulative 
Dual oracle (cf. Appendix B). This follows from the fact that the corrected distribution D is of 
the form D = XD -I- (1 — X)P, where both A and P are fully known. 


5.3 Correcting with Cumulative Dual access 

In this section we prove the following result, which shows that correcting monotonicity with o(logn) 
queries (on expectation) is possible when one allows a stronger type of access to the original 
distribution. In particular, recall that in the Cumulative Dual model (as defined in Appendix B) 
the algorithm is allowed to make, in addition to the usual draws from the distribution, evaluation 
queries to its cumulative distribution function. 

Theorem 5.7. For any e € (0,1], any number of queries m and ei = 0{e) as in the definition, 
there exists a sampling corrector for monotonicity from Cumulative Dual to SAMP with expected 
sample complexity O^i/mlogn/e^. 

In particular, since learning distributions in the Cumulative Dual model is easily seen to have 
query complexity 0(logn/e) (e.g., by considering the lower bound instance of [Bir87]), the above 
corrector beats the “learning approach” as long as m = o(logn/e). 


5.3.1 Overview and discussion 

A natural idea would be to first group the elements into consecutive intervals (the “buckets”), 
and correct this distribution (now a histogram over these buckets) at two levels. That is, start 
by correcting it optimally at a coarse level (the “superbuckets,” each of them being a group of 
consecutive buckets); then, every time a sample has to be generated, draw a superbucket from 
this coarse distribution and correct at a finer level inside this superbucket, before outputting a 
sample from the corrected local distribution (i.e. conditional on the superbucket that was drawn 
and corrected). While this approach seems tantalizing, the main difficulty with it lies in the 
possible boundary violations between superbuckets: that is, even if the average weights of the 
superbuckets are non-increasing, and the distribution over buckets is non-decreasing inside each 
superbucket, it might still be the case that there are local violations between adjacent superbuckets. 

remark that our algorithm will in fact only use this latter type of access, and will not rely on its ability to 
draw samples from D. 
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(I.e., the boundaries are bad.) A simple illustration is the sequence (.5, .1, .3, .1), where the first 
“superbucket” is (.5, .1) and the second (.3, .1). The average weight is decreasing, and the sequence 
is locally decreasing inside each superbucket; yet overall the sequence is not monotone. 

Thus, we have to consider 3 kinds of violations: 

(i) global superbucket violations: the average weight of the superbuckets is not monotone. 

(ii) local bucket violations: the distribution of the buckets inside some superbucket is not mono¬ 
tone. 

(iii) superbucket boundary violations: the probability of the last bucket of a superbucket is lower 
than the probability of the first bucket of the next superbucket. 

The ideas underlying our sampling corrector (which is granted both sampling and cumulative 
query access to the distribution, as defined in the Cumulative Dual access model) are quite simple: 
after reducing via standard techniques the problem to that of correcting a histogram supported 
of logarithmically many intervals (the “Birge decomposition”), we group these i intervals in K 
“superbuckets,” each containing L consecutive intervals from that histogram (“buckets”). (As a 
guiding remark, our overall goal is to output samples from a corrected distribution using o(£) 
queries, as otherwise we would already use enough queries to actually learn the distribution.) This 
two-level approach will allow us to keep most of the corrected distribution implicit, only figuring 
out (and paying queries for that) the portions from which we will ending up outputting samples. 

By performing K queries, we can exactly learn the coarse distribution on superbuckets, and 
correct it for monotonicity (optimally, e.g. by a linear program ensuring the average weights 
of the superbuckets are monotone), solving the issues of type (i). In order to fix the boundary 
violations (iii) on-the-go, the idea is to allocate to each superbucket an extra budget of probability 
weight that can be used for these boundary corrections. Importantly, if this budget is not entirely 
used the sampling process restarts from the beginning with a probability corresponding with the 
remaining budget. This effectively ends up simulating a distribution where each superbucket was 
assigned an extra weight matching exactly what was needed for the correction, without having to 
figure out all these quantities beforehand (as this would cost too many queries). 

Essentially, each superbucket is selected according to its “potential weight,” that includes both 
the actual probability weight it has and the extra budget it is allowed to use for corrections. 
Whenever a superbucket Si is selected this way, we first perform optimal local corrections of type 
(ii) both on it and the previous superbucket Si-i making a cdf query at every boundary point 
between buckets in order to get the weights of all 2L buckets they contain, and then computing 
the optimal fix: at this point, the distribution is monotone inside St (and inside Sj-i). After this, 
we turn to the possible boundary violations of type (iii) between 5j_i and Si, by “pouring” some 
of the weight from Sfs budget to fill “valleys” in the last part of Si-i. Once this water-filling has 
ended,we may not have used all of Sfs budget (but as we shall see we make sure we never run 
out of it): the remaining portion is thus redistributed to the whole distribution by restarting the 
sampling process from the beginning with the corresponding probability. Note that as soon as we 
know the weights of all 2L Birge buckets, no more cdf queries are needed to proceed. 

borrow this graphic analogy with the process of pouring water from [ACCL08], which employs it in a different 
context (in order to bound the running time of an algorithm by a potential-based argument.). 
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5.3.2 Preliminary steps (preprocessing) 


First step: reducing to D to a histogram. Given cumulative dual access (i.e., granting both 
SAMP and cumulative distribution function (cdf) query access) to an unknown distribution 
D over [n] which is e-close to monotone, we can simulate cumulative dual access to its Birge 
flattening also e-close to monotone and 3e-close to D (by Corollary 2.5). 

For this reason, we hereafter work with instead of D, as it has the advantage of being 
an ^-histogram for £ = 0(logn/e). Because of this first reduction, it becomes sufficient to 
perform cdf queries on the buckets (and not the individual elements of [n]), which altogether 
entirely define 


Second step: global correcting of the superbuckets. By making K cdf queries, we can fig¬ 
ure out exactly the quantities By running a linear program, we 

can re-weight them to obtain a distribution such that (a) the averages are non¬ 

increasing; (b) the conditional distributions of and on each superbucket are identical 


(Z)(2)5, = for all j E [K]); and (c) E, 


is minimized. 


Third step: allocating budgets to superbuckets. For reasons that will become clear in the 
subsequent, “water-filling” step, we want to give each superbucket Sj a budget bj of “extra 
weight” added to its first bucket Sj^i that can be used for local corrections when needed - if 
it uses only part of this budget during the local correction, it will need to “give back” the 
surplus. To do so, define as the distribution such that 


• = A®(ZZ(2)(5j) + bj), j E [K] (where bj = D^^\Sj)/{l + e) for j E [id]; and 

_\{3) bj)~^ is a normalization factor). Note that Ej bj = 1/(1 -|- e) E [1/2,1], 

so that E [1,2]. 

• The conditional distribution on Sj \ Sj^i satisfy ^ ^ for all j E [K], 

That is, is a version of where each superbucket is re-weighted, but “locally” looks 
the same inside each superbucket except for the first bucket of each superbucket, that received 
the additional “budget weight.” Observe that since the size |S'j| of the superbuckets is multi- 
plicatively increasing by an (1 -|-e) factor (as a consequence of Birge bucketing), the averages 
D^^\Sj)/ [S'jl will remain non-increasing. That is, the average changes by less for “big” values 
of j’s than for small values, as the budget is spread over more elements. 

Remark 5.8. is uniquely determined by e, n and D, and can be explicitly computed using 
K cdf queries. 


5.3.3 Sampling steps (correcting while sampling) 

Before going further, we describe a procedure that will be necessary for our fourth step, as it will 
be the core subroutine allowing us to perform local corrections between superbuckets. 


Water-filling. Partition each superbucket Si into range Hi, Mi and Lj where (assuming the 
buckets in Si are monotone): 

- rui = D(^\Si)/\Si\ is the initial value of the average value of superbucket Si [this does not 
change throughout the procedure] 
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- Hi are the (leftmost) elements whose value is greater than [these elements may move to 
Mi or stay in Hi] 

- Mi are the (middle) elements whose value is equal to m* 

- Li are the (rightmost) elements whose value is less than rrii 
or stay in Li] 

- miuj is the minimum probability value in super bucket Si 
procedure] 

- max, is the maximum probability value in superbucket Si 
procedure] 

Let Cj J2xeHi{p{^) ~ '^i) to be the surplus (so that if Cj = 0 then Hi = It) and the superbucket 

is said to be dry) and di ~ P{^)) to be the deficit (if di = 0 then Li = it) and the 

superbucket is said to be full). 


[these elements stay in Mj] 
[these elements may move to Mi 

[this updates throughout the 

[this updates throughout the 


Algorithm 1 Procedure water-fill 

1: take an infinitesimal amount dp from the top of the max, leftmost buckets of iLj+i, in super¬ 
bucket Sj+i (this would be from the hrst bucket and any other buckets that have the same 
probability) 

2: pour dp into superbucket Si (this would land in the min, rightmost buckets of Lj, in superbucket 
Si and spread to the left, to buckets that have the same probability, just like water) 


Algorithm 2 Procedure front-fill 

1: while the surplus ej+i is greater than the extra budget ftj+i allocated in Section 5.3.2 do 
2: take an infinitesimal amount dp from the top of the max, leftmost elements of iLi+i, in 

superbucket Sj+i (this would be from the first bucket and any other buckets that have the 
same probability) 

3: pour dp into the very first bucket of the domain, 5iq. 

4: end while 

5: return the total amount fi of weight poured into S'lp. 
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Algorithm 3 Procedure water-boundary-correction 

Require: Superbucket index j = i + I, with initial weight D^^\Si^i). 

1: move weight from the surplus of Rj+i into Li using water-fill until: 

(a) maxj_|_i < min^; or 

(b) Li = 9 {Si is full) - i.e. miuj = m*; or 

(c) Hi^i = 0 (S'j+i is dry) - this can only happen if Cj+i < di > This should never happen 
because of the "budget allocation” step. 

2: Note that the distribution might not yet be monotone on Si U S'i+i, if one of the last two 
conditions is reached first. If this is the case, then do further correction: 

(a) if Lj = 0 then do front-fill until maxj+i < minj (this will happen before = 0) 

(b) if Rj+i = 0 then abort and return FAIL > This should never happen because of the 

“budget allocation” step. 

def 

3: return the list Bi,..., Bg of buckets in Ti = LiU S'j+i, along with the weights wi,... ,Ws they 
have from w after the redistribution and the portion e* of the budget that was not used and 
the portion /* that was moved by front-fill (so that A^^^Si + fi + = -^^^H^'i-i-i))- 


Sampling procedure. Recall that we now start and work with as obtained in Section 5.3.2. 

• Draw a superbucket Sj+i according to the distribution l)(^^(Si), D^^^(Sx) on [K]. 

• If 7 ^ Si (we did not land in the first superbucket): 

— Obtain {via cdf queries, if they were not previously known) the 2L values D^^\Sij) 
D^^\Si-^ij) {j € [L]) of the buckets in superbuckets iS'j,5j+i. 

— Correct them (separately for each of the two superbuckets) optimally for monotonicity, 
e.g. via linear programming (if that was not done in a prior stage of sample generation), 
ignoring the extra budget bi and on the first bucket of each superbucket. Compute 
Bi^iAli^Li and . 

— Call water-boundary-correction on (i-Fl) using the extra budget only if Sj+i becomes dry 
and not counting it while trying to satisfy condition (a).^^ 

• If = Si (we landed in the first superbucket), we proceed similarly as per the steps above, 
except for the water-boundary-correction. That is, we only correct locally for monotonicity. 

• During the execution of water-boundary-correction, the water-filling procedure may have used 
some of the initial “allocated budget” bi+i to pour into Li. Let e* € [0,6i+i] be the amount 
of the budget remaining (not used). 

— with probability Pi X^^^ei/D^^^Si+i), restart the sampling process from the beginning 
(this is the “budget redistribution step,” which ensures the correction only uses what it 
needs for each superbucket). 

this point, the “new” distribution (which is at least partly implicit, as only known at a very coarse 

level over superbuckets and locally for some buckets inside Si U S'i+i) obtained is monotone over the superbuckets 
(water-boundary-correction does not violate the invariant that the distribution over superbuckets is monotone), is 
monotone inside both Si and Si+i, and furthermore is monotone over Si U Si+i. Even more important, the fact that 
mini > maxi+i will ensure applying the same process in the future, e.g. to Si+ 2 , will remain consistent with regard 
to monotonicity. 
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- with probability qi fi/D^'^\Si+i), where fi is the weight moved by the procedure 
front-fill, output from the very first bucket of the domain. 

— with the remaining probability, output a sample from the new (conditional) distribution 

def 

on the buckets in Tj = Lj U S'j+i. This is the conditional distribution defined on Tj by 
the weights wi,... ,Ws, as returned by water-boundary-correction. 

Note that the distribution we output from if we initially select the superbucket S'j+i, is supported 
on Li U Moreover, conditioning on Mj+i U Tj+i we get exactly the conditional distribution 

-^Mi+iULi+i’ (This ensures that from each bucket there is a unique superbucket that has to be 
picked initially for the bucket’s weight to be modified.) Observe that as defined above, bnckets 
from Li C Si can be outputted from either because superbucket Si was picked, or because Si+i was 
drawn and some of its weight was reassigned to Lj by water-boundary-correction. The probability 
of outputting any bucket in Lj is then the sum of the probabilities of the two types of events. 

5.3.4 Analysis 

The first observation is that the distribution of any sample output by the sampling process described 
above is not only consistent, but completely determined by n, £ and D: 

Claim 5.9. The process described in Section 5.3.2 and 5.3.3 uniquely defines a distribution D, 
which is a function of D, n and e G (0,1) only. 

Claim 5.10. The expected number of queries necessary to output m samples from D is upper 
bounded by K + AmLe. 

Proof. The number of queries for the preliminary stage is K] after this, generating a sample requires 
X queries, where A is a random variable satisfying 

X <2L + RX' 

where A, X' are independent and identically distributed, and R is a Bernoulli random variable 
independent of X' and with parameter A (itself a random variable depending on A: A takes value 
Pi when the first draw selects superbucket * -|- 1), corresponding to the probability of restarting the 
sampling process from the beginning. It follows that 

IE[A] <2L + E[l?] E[A] = 2L + E[A] E[A] . 

Using the fact that E[A] = Y.i£[K] D^^\Si+i)pi = Y.i(^[K] Y.i£[K]bi e 

[1/3,1/2] and rearranging, we get E[A] < 4L. □ 

Lemma 5.11. If D is a distribution on [n] satisfying dTY{D,Ai) < £, then the distribution D 
defined above is monotone. 

Proof. Observe that as the average weights of the superbuckets in are non-increasing, the 
definition of along with the fact that the lengths of the superbuckets are (multiplicatively) 
increasing implies that the average weights of the superbuckets in are also non-increasing. In 
more detail, fixl<i<A — l;we have 

|5*| -(l + e)|5.| 
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using the fact that \Sj\ = (1 + e) ISj-il- From there, we get that 


{1 + e)D^^\S,) > D^^\Si+i) 


or equivalently 


h + _ D(^\Si) + {1 + e)D(^\Si) ^ D(^\Si+i) + {1 + e)D(^'){Si+i) 

|5i| (l + e)l«5i| “ (l + e)2|5*l 


bi+i + D(~^\S,+i) 

\Si+i\ 


showing that before renormalization (and therefore after as well) the average weights of the super¬ 
buckets in are indeed non-increasing. Rephrased, this means that the sequence of m^’s, for 
i S [K], is monotone. Moreover, notice that by construction the distribution D is monotone within 
each superbucket: indeed, it is explicitly made so one superbucket at a time, in the third step of 
the sampling procedure. After a superbucket has been made monotone this way, it only be changed 
by water-filling which by design can never introduce new violations: the weight is always moved 
“to the left,” with the values mi’s acting as boundary conditions to stop the waterfilling process 
and prevent new violations, or moved to the first element of the domain. 

It only remains to argue that monotonicity is not violated at the boundary of two consecutive 
superbuckets. But since the water-boundary-correction, if it does not abort, guarantees that the 
distribution is monotone between consecutive buckets as well (as mj+i < maxj+i < miuj < ruj), 
it it sufficient to show that water-boundary-correction never returns FAIL. This is ensured by the 
“budget allocation” step, which by providing iFj+i with up to an additional to spread into 
Li guarantees it will become dry. Indeed, if this happened then it would mean that correcting 
this particular violation (before the budget allocation, which only affects the first elements of 
the superbuckets) in required to move more than weight, contradicting the fact that 
the average weights of the superbuckets in were non-increasing. In more detail, the maximum 
amount of weight to “pour” in order to fill Lj is in the case where is empty (i.e., the distribution 
on is already uniform) but Lj is (almost) all of Si (i.e., all the weight in Si is in the first bucket). 
To correct this with our waterfilling procedure, one would have to pour \Si\ ■ 

weight in Lj, which is exactly our choice of value for 6i+i. □ 

Lemma 5.12. If D is a distribution on [n] satisfying dTv(A>,Al) < e, then d'TY(^D,D^ = 0{s). 


Proof. We will bound separately the distances D to to and to D, and conclude 

by the triangle inequality. 

• First of all, the distance dTv(-C *5 is at most 3e, by properties of the Birge decomposition 

(and as dTv(A>,Al) < e). 

• We now turn to dTV^A)^^),, showing that it is at most e: in order to do so, we intro¬ 
duce D' , the piecewise-constant distribution obtained by “flattening” on each of the K 
superbuckets (so that D'{Sj) = D^^^Sj) for all j). It is not hard to see, e.g. by the data 
processing inequality for total variation distance, that D' is also e-close to monotone, and 
additionally that the closest monotone distribution M' can also be assumed to be constant 
on each super bucket. 

Consider now the transformation that re-weights in D' each superbucket Sj by a factor aj > 0 
to obtain M'; it is straightforward to see from Section 5.3.2 that this transformation maps 
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to Therefore, 

2dTv(0''>,0'">)= S E |d<'>W-D®{^)|= E E |o''>W - OiD''>(= 

j&[K]x&Sj j&[K]xeSi 

= E E • |1 - «. l = E • |1 - a, I 

j&[K]x&Sj j&[K\ 

M'{Sj 


= E E • ii - «ji = E • II - 

je[K]xeSj je[K] 

= Y1 \D'{Sj) - M'{S,)\ = 2dTy{D',M') < 2e. 


D\S, 


je[K] 


To bound d^y , first consider the distribution D" obtained by correcting optimally 

£)(2) fQj- monotonicity inside each superbucket separately. That is, D" is the distribution 
satisfying monotonicity on each Sj (separately) and D''{Sj) = D^‘^\Sj) for each j G [K]; and 
minimizing 

E E |o"(%)-dp|(%.) 

jelK] ielL] 

(or, equivalently, minimizing J2ie[L] ~ for all j G [K])- The first step is 

to prove that D” is close to recall first that by the triangle inequality, our previous 

argument implies that is (2e)-close to monotone. Therefore, the (related) optimization 
problem asking to find a non-negative function P that minimizes the same objective, but 
under the different constraints “P is monotone on [n] and P{[n]) = P(^)([n])” has a solution 
P whose total variation distance from is at most 2e. 

But P can be used to obtain P', solution to the original problem, by re-weighting each 
superbucket Sj the following way; 




X G Si 


Clearly, P' satishes the constraints of the first optimization problem; moreover. 


2dTv{P',D^^^)= E \P'ix)-D(-^\x)\= Y. Y 

jG[K]x£Sj j(i[K]x£Sj 




-EE \p{x) - D^^\x)\+ Y E 

j&[K]x£Sj jG[K]x£Sj 


P{S,, 

oYiSj 


PiSj 


- 1 


2dTv(^,^^'^) + E - P(5’,)| < 

j&[K] 


< 8e, 


where we used the fact that J2je[K] D^'^\Sj) — PiSj) = J2je[K] J2xeSj {P>^‘^\x) — P{x)^ 
Yl,j&[K] J2xeSj D^‘^\x) — P{x) . As dTv(-P^ is an upperbound on the optimal value of 

the optimization problem, we get d^ry (^D”, < 4e. 
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The next and last step is to bound dxv, Z)j, and show that it is 0{s) as well. To see 
why this will allow us to conclude, note that D" is the intermediate distribution that the 
sampling process we follow would define, it there was neither extra budget allocated nor 
water-boundary-correction. Put differently, D is derived from D” by adding the “right amount 
of extra budget 6'- E [0,6j]” to Sj, then pouring it to Sj-i by waterhlling and front-filling; 
and normalizing afterwards by (1 -|- 

Writing D" for the result of the transformation above before the last renormalization step, 
we can bound by 


2dTv( 


D''D] = 


W -b\ 


i< 


ID" - b''\ 


+ \\D" -D\ 


s E '>; + E + E (i + E - o{x) 

j&[K] ie.[K] x&[n] J6[X] 


^ H H /i + I (l + H H b 

j&[K] j&[K] j&[K] j&[K] j&[K] 


where fj > 0 is defined as the amount of weight moved from Hj to the first element of 
the domain during the execution of water-boundary-correction, if front-fill is called, and the 
bound on ||D" — D"||^ comes from the fact that D" pointwise dominates D", and has a total 
additional J2j£[K] ^'j weight. 

It then suffices to bound the quantities fj J2j£[K] using for this the fact that 

by the triangle inequality D" is itself (6e)-close to monotone. The at most K intervals 
where D" violates monotonicity (which are fixed by using the h'^s) are disjoint, and centered 
at the boundaries between consecutive superbuckets: i.e., each of them is in a interval Vj C 
Lj-iUHj C Sj-iUSj. Because of this disjointness, each transformation of D" into a monotone 
distribution must add weight in Vj n Lj-i or subtract some from Vj n Hj to remove the 
corresponding violation. By dehnition of b'j (as minimum amount of additional weight to bring 
to Lj-\ when spreading weight from Hj to Lj-i), this implies that any such transformation 
has to “pay” at least b'j/2 (in total variation distance) to hx violation Vj. From the bound on 
dTY{D'', Ai), we then get J2j£[K] ^'j — ^ similar argument shows than J2j£[K] fj — 

as well, which in turn yields dxv^-^")-^) < 18e. 

• Putting these bounds together, we obtain 

dTV [D,b) < dTv(^,^^^^) +dTv(^^^\^") +dTv(^",£») 

< 3e -|- e -I- 4e -|- 18e = 26e. 


□ 


We are finally in position of proving the main result of the section: 

Proof of Theorem 5. 7. The theorem follows from Claim 5.9, Claim 5.10, Lemma 5.11 and Lemma 5.12, 
setting K = mL = V mi (where i = 0{logn/e) as defined in the Birge decomposition). □ 
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6 Constrained Error Models 


In the previous sections, no assumption was made on the form of the error, only on the amount. 
In this section, we suggest a model of errors capturing the deletion of a whole “chunk” of the 
distribution. We refer to this model as the missing data model, where we assume that some e 
probability is removed by taking out all the weight of an arbitrary interval [i,j] for 1 < i < j < n 
and redistributing it on the rest of the domain as per rejection sampling.We show that one 
can design sampling improvers for monotone distributions with arbitrarily large amounts of error. 
Hereafter, D will denote the original (monotone) distribution (before the deletion error occured), 
and D' = the resulting (faulty) one, to which the sampling improver has access. Our sampling 
improver follows what could be called the “learning-just-enough” approach: instead of attempting 
to approximate the whole unaltered original distribution, it only tries to learn the values of i,j\ 
and then generates samples “on-the-fly.” At a high level, the algorithm works by (i) detecting the 
location of the missing interval (drawing a large (but still independent of n) number of samples), 
then (ii) estimating the weight of this interval under the original, unaltered distribution; and finally 
(iii) filling this gap uniformly by moving the right amount of probability weight from the end of 
the domain. To perform the first stage, we shall follow a paradigm first appeared in [DDS14], and 
utilize testing as a subroutine to detect “when enough learning has been done.” 

Theorem 6.1. For the class of distributions following the “missing data” error model, there ex¬ 
ists a batch sampling improver Missing-Data-Improver, that, on input e, q, 5 and a, achieves 
parameters Si = 0{e) and any 62 < e; and has sample complexity 0 ^^ 1 og|^ independent of e. 

The detailed proof of our approach, as well as the description of Missing-Data-Improver, are 
given in the next subsection. 

6.1 Proof of Theorem 6.1 

Before describing further the way to implement our 3-stage approach, we will need the following 
lemmata. The first examines the influence of adding or removing probability weight e from a 
distribution, as it is the case in the missing data model: 

def 

Lemma 6.2. Let D be a distribution over [n] and e > 0. Suppose D' = (1 + e)D — eDi, for some 
distribution Di. Then dT'y{D, D') < e. 

The proof follows from a simple application of the triangle inequality to the ii distance between D 
and D'. We note that the same bound applies D' = {1 — £)D + sDi. 

The next two lemmata show that the distance to monotonicity of distributions falling into this 
error model can be bounded in terms of the probability weight right after the missing interval. 

Lemma 6.3. Let D be a monotone distribution and D' = T)hd) ^/jg faulty distribution. If 
D{\j + l,2j — i + 1]) > e, then D' is ej^-far from monotone. 

That is, if D was the original distribution, the faulty one is formally defined as (1 ■ D — e -12^^], 

where e = D{[i,j]). 
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Proof. Let L = j — i he the length of the interval where the deletion occurred. Since the interval 
[j + l,2j — i + 1] has the same length as [i,j] and weight p > e, the average weight of an element is 
at least -j. Every monotone distribution M should also be monotone on the interval [i, 2j — i + 1]: 

therefore, one must have M{[i,j]) > M{[j + l,2j — i + 1]). Let q M{[i,j]). As D'{[i,j]) = 0, 
we get that 2dTY(^D',D^ > q. On one hand, ii q < p then at least q — p weight must have been 
“removed” from [i,2j — i + 1] to achieve monotonicity, and altogether 2d'jY{D',M) > q + {p — q) = p- 
On the other hand, q> p we directly get 2dTv(-D^ > Q > P- In both cases, 

dTv(I^^ > p/2 > e/2 


and D' is e/2-far from monotone. □ 

Lemma 6.4. Let D be a monotone distribution and D' = L)h.i) above. If D'([j + l,2j — i + 1]) < 
e/2, then D' is e-close to monotone. 

Proof. We will constructively define a monotone distribution M which will be e-close to D'. Let 
p D'{[j + \,2j — i + 1]) < e/2. According to the missing data model, D' should be monotone 
on the intervals [l,i — 1] and [j + l,n]. In particular, the probability weight of the last element of 
[j + l,2j — i + 1] should be below the average weight of the interval, i.e. for all /c > 2j — i -|- 1 one 
has D'{k) < D'{2j -i + l)< j^. 

So, if we let the distribution M (that we are constructing) be uniform on the interval [j -|-1, 2j — 
i + V\ and have also total weight p there, monotonicity will not be violated at the right endpoint of 
the interval; and the Ii distance between D' and M in that interval will be at most 2p. ’’Taking” 
another p probability weight from the very end of the domain and moving it to the interval [i,j] 
(where it is then uniformly spread) to finish the construction of M adds at most another 2p to the 
Ii distance. Therefore, 2dTv(I^/-^) < 2p + 2p < 2e; and M is monotone as claimed. □ 

The sampling improver is described in Algorithm 4. 


Implementing (i): detecting the gap 

Lemma 6.5 (Lemma (i)). There exists an algorithm that, on input a E (0,1/3) and 5 E (0,1), 
takes o(^-^ log samples from D' = and outputs either two elements a,b £ [n] or close such 
that the following holds. With probability at least 1 — d, 

• if it outputs elements a,b, then (a) [i,j] C [a,b] and (b) D'{[a,b]) < 3a^; 

• if it outputs close, then D' is a^-close to monotone. 

Proof. Inspired by techniques from [DDS14], we hrst partition the domain into t = 0(l/a^) in¬ 
tervals of roughly equal weight as follows. By taking 0^^1og|^ samples, the DKW 

inequality ensures that with probability at least 1 — 5/2 we obtain an approximation D of D', 
close up to a^/5 in Kolmogorov distance. We hereafter assume this holds. For our partitioning 
to succeed, we first have to take care of the “big elements,” which by assumption on D' (which 
originates from a monotone distribution) must all be at the beginning. In more detail, let 


def 

r = max 


X £ [n] : D{x) > 


4a^ 
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Algorithm 4 Missing-Data-Improver 

Require: e, 82 < £, 6 G (0,1) and q > I, sample access to D'. 

1: Start > Preprocessing 

2: Draw m log samples from D' = 

3: Run the algorithm of Lemma 6.5 on them to get an estimate (a, h) of the unknown {i,j) or 

the value close. 

4: Run the algorithm of Lemma 6.6 on them to get an estimate 7 of Z)'([6 + 1, 26 — a + 1]), and 

values c, 7' such that | D' ([c, n]) — 7' | < . 

5: End 

6: Start i> Generating 

7: for i from 1 to do 

8: Draw s,- from D'. 

3/2 

9: if the second step of Preprocessing returned close, or 7 < then 

10: return Sj i> The distribution is already e2-close to monotone; do not change it. 

11: end if 

12: if Si S [c, n] then > Move 7 weight from the end to [a, b] 

13: With probability 7/7^ return a uniform sample from [a, 6] 

14: Otherwise, return Si 

15: else if s* E [6 + 1, 26 — a + 1] then 

16: return a uniform sample from [6 + 1,26 — a + 1] 

17: else 

18: return Sj > Do not change the part of D' that need not be changed. 

19: end if 

20: end for 

21: End 
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and B {1,..., r} be the set of potentially big elements. Note that if D'{x) > a^, then necessarily 
X ^ B. This leaves us with two cases, depending on whether the “missing data interval” is amidst 
the big elements, or in the tail part of the support. 

• If [i,j] C S: it is then straightforward to exactly find i,j, and output them as a, b. Indeed all 
elements x & B have, by monotonicity, either D'{x) > D'{r) > or D'{x) = 0 (the latter 
if and only if x € [i,j])- Thus, one can distinguish between x € [i, j] (for which D{x) < a^/5) 
and X ^ [i,j] (in which case D{x) > 2a^/5). 

• If [i,j] % B: then, as r ^ [z,j] (since D'{r) > 0), it must be the case that [i,j] C S = 
{r + 1,... ,n}. Moreover, every point x ^ B is “light:” D'{x) < and D{x) < We 

iteratively define Ii,... ,It C B, where It = [r* + l,rj+i]: ri r + 1, n, and for 

rj+i min | s > r* : D{[ri + 1 , x]) > | . 

This guarantees that, for all i G [t], D'{Ii) E [a^ — C [a^ — + ^f-]. 

(And in turn that t = 0(l/a^) as claimed.) Observing that the definition of the missing 
data error model implies D' is 2-modal, we can now use the monotonicity tester of [DDS14, 
Section 3.4]. This algorithm takes only O^^log^^ samples (crucially, no dependence on n) 
to distinguish with probability at least 1 — 5 whether a /c-modal distribution is monotone 
versus e-far from it. 

We iteratively apply this tester with parameters A; = 2,e = a^/4 and 5' = 0{6/t), to each of 
the at most t prefixes of the form Pg a union bound ensures that with probability 

at least 1 — 5/2 all tests are correct. Conditioning on this, we are able to detect the first 
interval Ii* which either contains or falls after j (if no such interval is found, then the input 
distribution is already a^-close to monotone and we output close). In more detail, suppose 
hrst no run of the tester rejects (so that close is outputted). Then, by Lemma 6.3, we must 
have D{[j + 1, 2j — i + 1]) < 2 • a^/4 = a^/2, and Lemma 6.4 guarantees D' is then a^-close 
to monotone. 

Suppose now that it rejects on some prefix Pi* (and accepted for all ^ < .^*). As D' is non¬ 
increasing on [l,j], we must have [i,j] C Pi*. Moreover, the tester will by Lemma 6.3 reject 
as soon as an interval [j -t- 1 , s] C [j + \.,2j — i + 1 ] of weight 0^/2 is added to the current 
prefix. This implies, as each Ii has weight at least 0 ^/ 2 , that [i,j] C Ii*_i VJl* = [a, 6 ]. 

Finally, observe that the above can be performed with ^ • logt^ = log sam¬ 

ples, as claimed (where the first Ijo? factor comes from doing rejection sampling to run the 
tester with domain Pi only, which by construction is guaranteed to have weight n(l/a^)). 
The overall probability of failure is at most 5/2 -|- 5/2 = 5, as claimed. 

□ 

Implementing (ii): estimating the missing weight Conditioning on the output a,b of 
Lemma 6.5 being correct, the next lemma explains how to get a good estimate of the total weight 
we should put back in [a, b] in order to fix the deletion error. 

Lemma 6 . 6 . Given D', a as above, 5 E (0,1) and a,b such that [i,j] C [a,b] and D'{[a,b]) < 3a^, 
there exists an algorithm which takes O^^log^^ samples from D' and outputs values 7 , 7 ' and c 
such that the following holds with probability at least 1 — 5: 
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(i) \D'{[b + 1, 26 - a + 1]) - 7 I < ; 

(ii) \D'{\c,n]) — 7 '! < and 7 ' > 7 ; 

(in) D'{[c, n]) > D'{[b + 1,26 — a + 1]) — 2a^ and D'{[c + 1, re]) < D'{[b + 1, 26 — a + 1]) + 2a^; 

(iv) 7 < 2 e + 4a^. 

Proof. Again by invoking the DKW inequality, we can obtain (with probability at least 1 — 5) 
an approximation D of D', close up to a^/2 in Kolmogorov distance. This provides us with an 
estimate 7 of Il '([6 + 1 , 26 — a + Ij) satisfying the first item (as, for any interval [r, s], D{[r, sj) is 
within an additive a^/2 of D'([r, s])). Then, setting 

c max I X E [re] : D{[x, re]) > 7 | 

and 7 ' D{[c,n]), items (ii) and (iii) follow. The last bound of (iv) derives from an argument 

identical as of Lemma 6.3 and the promise that D' is e-close to monotone: indeed, one must then 
have D'{[b+ 1,2b — a + 1]) < D'{[a, 6 ]) + 2e < 2e -t- 3a^, which with (i) concludes the argument. □ 

To finish the proof of Theorem 6.1, we apply the above lemmata with a 0(^/^); and need 
to show that the algorithm generates samples from a distribution that is S 2 = 0 (q;^)- close to 
monotone. This is done by bounding the error encountered (due to approximation errors) in the 
following parts of the algorithm: when estimating the weight 7 of an interval of equal length 
adjacent to the interval [a, 6 ], uniformizing its weight on that interval, and estimating the last 7 - 
quantile of the distribution, in order to move the weight needed to fill the gap from there. If we 
could have perfect estimates of the gap ([a, 6 ] = [i,j\), the missing weight 7 and the point c such 
that D'{\c, re]) = 7 , the corrected distribution would be monotone, as the probability mass function 
in both the gap and the next interval would be at the same “level” (that is, ^^ 7 + 1 )- 

By choice of rre, with probability at least 1 — 5 the two subroutines of the Preprocessing stage 
(from Lemma 6.5 and Lemma 6 . 6 ) behave as expected. We hereafter condition on this being the 
case. For convenience, we write I = [a, 6], J = [6 -|- 1, 26 — a -|- 1] and K = [c, re], where a, 6, c and 
7 , 7 ' are the outcome of the preprocessing phase. 


If the test in Line 9 passes. If the preprocessing stage returned either close, or a value 7 < 
562 ^^ = 5q;^, then we claim that D' is already 0(Q:^)-close to monotone. The first case is by 
correctness of Lemma 6.5; as for the second, observe that it implies D'{J) < 60 ;^. Thus, “putting 
back” (from the tail of the support) weight at most 6 a^ in [i,j] would be sufficient to correct the 
violation of monotonicity; which yields an 0{o?) upperbound on the distance of D' to monotone. 


Otherwise. This implies in particular that 7 > 5a^, and thus D'{J) > 4a^. By Lemma 6.6 (iii), 
it is then also the case that D'{K) > 2a^. Then, denoting by D the corrected distribution, we have 


D{x) = < 


D'{x) + ^ 

D'iJ) 


D'{K) 

~w~ 


D'{x)-{l-fL) 

D'{x) 


if X E / 
if X E J 
if X E AT 
otherwise. 
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Distance to D'. From the expression above, we get that 

2Atw(d,D'') < ^D'{K) + 2D'{J) + ^D'{K) = 2(^^^D'{K) + D'{J)^ . 

From Lemma 6 . 6 , we also know that D'{J) < 7 + D'{K) < 7 ^ + 0 ^ and 7 / 7 ' < 1, so that 

dTv(-^) < ^( 7 ^ + a^) + 7 + < 2(7 + a^) < 4e + lOa^ = 0(e). 

(Where, for the last inequality, we used Lemma 6.6 (iv); and finally the fact that £2 < e)- 


Distance to monotone. 


Consider the distributions M defined as 

fD'(x) + ^ 


D'(J) 

M(x) = 


[ffix) 


if X S / 
if X S J 

if X E iL 
otherwise. 


We first claim that M is 0(a^)-close to monotone. Indeed, M is monotone on [a, n] by construction 
(and as D' was monotone on [ 6 , n]). The only possible violations of monotonicity are on [1,6], due 
to the approximation of (i, j) by (a, 6 ) - that is, it is possible for the interval [a, i] to now have too 
much weight, with M(a — 1) < M(a). But as we have D'([a, 6 ]) < 3a^, the total extra weight of 
this “violating bump” is O(a^). 

Moreover, the distance between M and D can be upperbounded by their difference on J and 
K: 


2(1'-£\ {JD , < 


D'(J) - ^D'(iL) 
7 


< Sa’lLiVa < 6„3 


1 - 




where we used the fact that ^ E 


■ D'{J)-o? D'[J)+o? 

D'{K)+a-^ ’ D'{K)-a^ 


inequality, D is then itself 0(a^)-close to monotone. T 


, and that D'(K) > 2q^. By the triangle 
lis concludes the proof of Theorem 6.1. □ 


7 Focusing on randomness scarcity 

7.1 Correcting uniformity 

In order to illustrate the challenges and main aspects of this section, we shall focus on what is 
arguably the most natural property of interest, “being uniform” (i.e. V = {7/„}). As a first 
observation, we note that when one is interested in correcting uniformity on an arbitrary domain 
n, allowing arbitrary amounts of additional randomness makes the task almost trivial: by using 
roughly log|r2| random bits per query, it is possible to interpolate arbitrarily between D and the 
uniform distribution. One can naturally ask whether the same can be achieved while using no - 
or very little ~ additional randomness besides the draws from the sampling oracle itself. As we 
show below, this is possible, at the price of a slightly worse query complexity. We hereafter focus 
once again on the case 0 = [n], and give constructions which achieve different trade-offs between 
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the level of correction (of D to uniform), the fidelity to the original data (closeness to D) and 
the sample complexity. We then show how to combine these constructions to achieve reasonable 
performance in terms of all the above parameters. In Section 7.1.1, we turn to the related problem 
of correcting uniformity on an (unknown) subgroup of the domain, and extend our results to this 
setting. Finally, we discuss the differences and relations with extractors in Section 7.2. 


High-level ideas The first algorithm we describe (Theorem 7.1) is a sampling corrector based 
on a “von Neumann-type” approach; by seeing very crudely the distribution D as a distribution 
over two points (the first and second half of the support [n]), one can leverage the closeness of D 
to uniform to obtain with overwhelming probability a sequence of uniform random bits; and use 
them to generate a uniform element of [n]. The drawback of this approach lies in the number of 
samples required from D: namely, 0(logn). 

The second approach we consider relies on viewing [n] as the Abelian group and leverages 
crucial properties of the convolution of distributions. Using a robust version of the fact that the 
uniform distribution is the absorbing element for this operation, we are able to argue that taking a 
constant number of samples from D and outputting their sum obeys a distribution D exponentially 
closer to uniform (Theorem 7.2). This result, however efficient in terms of getting closer to uniform, 
does not guarantee anything non-trivial about the distance D to the input distribution D. More 
precisely, starting from D which is at a distance e from uniform, it is possible to end up with D at 
a distance e' from uniform, but e -|- n(eO from D (see Claim A. 4 for more details). In other terms, 
this improver does get us closer to uniform, but somehow can overshoot in the process, getting too 
far from the input distribution. 

The third improver we describe (in Theorem 7.3) yields slightly different parameters: it essen¬ 
tially enables one to get “midway” between D and the uniform distribution, and to sample from a 
distribution D (almost) (e/2)-close to both the input and the uniform distributions. It achieves so 
by combining both previous ideas: using D to generate a (roughly) unbiased coin toss, and deciding 
based on the outcome whether to output a sample from D or from the improver of Theorem 7.2. 

Finally, by “bootstrapping” the hybrid approach described above, one can provide sampling 
access to an improved D both arbitrarily close to uniform and (almost) optimally close to the 
original distribution D (up to an additive 0[s^)), as described in Theorem 7.4. Note that this is 
at a price of an extra log(l/e 2 ) factor in the sample complexity, compared to Theorem 7.2: in a 
sense, the price of “staying faithful to the input data.” 


Theorem 7.1 (von Neumann Sampling Corrector). For any e < 0.49 (and £i = e) as in the defini¬ 
tion, there exists a sampling corrector for uniformity with query complexity 0 (logn(loglogn -|- log(l/(5))) 
(where 5 is the probability of failure per sample). 


Proof. Let H be a distribution over [n] such that dTy{D,U) < e < 1/2 — c for some absolute 
constant c < 1/2 (e.g., c = 0.49), and let So,Si denote respectively the sets {l,...,n/2} and 
{n/2 -|- 1,..., n}. The high-level idea is to see a draw from D as a (biased) coin toss, depending on 
whether the sample lands in Sq or by applying von Neumann’s method, we then can retrieve a 
truly uniform bit at a time (with high probability). Repeating this log n times will yield a uniform 
draw from [n]. More precisely, it is immediate by definition of the total variation distance that 


def 

|H(5o) — H(S'i)| < 2e, so in particular (setting p = D{So)) we have access to a Bernoulli random 


variable with parameter p G 
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To generate one uniform random bit (with probability of failure at most 6' = 6/logn), it is 


5' 


sufficient to take in the worst case m (logy^) ^ log 
SqSi or S'lS'o is seen (giving respectively a bit 0 or 1). I 
VN-Improver„ outputs FAIL; the probability of failure is therefore 


samples, and stop as soon as a sequence 
it does not happen, then the corrector 


Pr[ VN-Improver„ outputs FAIL] = + (1 -p)™ < 2 • (1 - c)™' < 6' = - -. 

logn 

By a union bound over the logn bits to extract, VN— Improver^ indeed outputs a uniform random 
number s G [n] with probability at least 1 — (f, using at most mlogn = O^^lognlog samples— 

and, in expectation, only 0((logn)/p) = O(logn). □ 


As previously mentioned, we hereafter work modulo n, equating [n] to the Abelian group (Z„, +). 
This convenient (and equivalent) view will allow us to use properties of convolutions of distribu¬ 
tions over Abelian groups,^® in particular the fact that the uniform distribution on is (roughly 
speaking) an attractive fixed point for this operation. In particular, taking D to be the (unknown) 
distribution promised to be e-close to uniform. Fact A.3 guarantees that by drawing two indepen¬ 
dent samples x,y ^ D and computing z = x + y mod n, the distribution of z is (2e^)-close to the 
uniform distribution on {0,... , n — 1}. This key observation is the basis for our next result: 

Theorem 7.2 (Convolution Improver). For any e < £2 o-nd ei = e -F 62 as in the definition, 

there exists a sampling improver for uniformity with query complexity O 

Proof. Extending by induction the observation above to a sum of finitely many independent sam- 

cjef log -i—1 

pies, we get that by drawing k = -—independent elements si,... ,Sk from D and computing 

O E 



s = 



-F 1 G [n] 


the distribution l) of s is (^ (2e)^)-close to uniform; and by choice of k, (^ (2e)*^) = £ 2 - As 
dTv(^T),T>^ < dT^Y{D,lA) +<Ftv{u,D^ < £ + £ 2 , the vacuous bound on the distance between D 
and D is as stated. □ 


This triggers a natural question: namely, can this “vacuous bound” be improved? That is, 
setting £ dTv(?^)^) and (A:-fold convolution), what can be said about 

dT'y (^D, as a function of £ and kl Trivially, the triangle inequality asserts that 

e _ < dTv { d , W) < e + 2^-^e^ ; 

but can the right-hand side be tightened further? For instance, one might hope to achieve e. 
Unfortunately, this is not the case: even for k = 2, one cannot get better than e -F Ll(e^) as an 
upper bound. Indeed, one can show that for e G (0, ^), there exists a distribution D on such 
that dT:y{D,U) = e, yet dTv(?^) D * D) = e -F -F O(e^) (see Claim A.4 in the appendix). 

^®For more detail on this topic, the reader is referred to Appendix A. 
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Theorem 7.3 (Hybrid Improver). For any £ < £i = ^ + 2e^ + e' and £2 = | + there exists 
a sampling improver for uniformity with query complexity O 

Proof. Let D he a distribution over [n] such that dTY{D,U) = £, and write do (resp. di) for 
D{{1,... ,n/2}) (resp. D{{n/2 + 1,... ,n})). By definition, |(io — di| < 2e. Define the Bernoulli 
random variable X by taking two independent samples si, S 2 from D, and setting X to 0 if both 
land in the same half of the support (both in {1,..., ra/2}, or both in {n/2 + 1,..., n}). It follows 

that po Pr[X = 0] = do + df and pi Pr[X = 1 ] = 2dodi, i.e. 0 < po ~Pi = (<^1 ~ ^^ 2 )^ < 4e^. 

In other terms, X ~ Bern(po) with ^ < po < 5 + 2£^. 

Consider now the distribution 

fc times D =^(l-po)^+PO^^^^ 

where = D * ... * D as in Theorem 7.2. Observe than getting a sample from D only requires 
at most k + 2 queries^® to the oracle for D. Moreover, 

dTv{D,u) < {l-po)dTY{D,U)+PodTY{D^'^\u) < (l-po)£+po2'^-ie'= < |+Q + £2^ (2£)" < |+^(2£)" 

while 

dTv(A^) <PodTY{D^'^\D) <po (£ + 2 ^-V) < Q + 2 e') (e + 2 "^" V) < £ + 2£3 + i( 2 £)'^ 

(recalling for the rightmost step of each inequality that £ < ^). Taking k = 3, one obtains, with a 
sample complexity at most 5, a distribution D satisfying 

dTV < - + 4£^, dxv ^ 2 +d£^ ■ 

(Note that assuming £ < 1/4, one can get the more convenient - yet looser - bounds d^YwipM^ < 

i£<f,dTv(0,0)<S<f.) □ 

Theorem 7.4 (Bootstrapping Improver). For any £ < ^, 0 < £2 < £ and £1 = £ — £2 + 
there exists a sampling improver for uniformity with query complexity O 

Proof. We show how to obtain such a guarantee - note that the constant 27 in the Op) is not 
tight, and can be reduced at the price of a more cumbersome analysis. Let a > 0 be a parameter 
(to be determined later) satisfying a < £ 2 , and k be the number of bootstrapping steps - i.e., the 
number of time one recursively apply the construction of Theorem 7.3 with a. We write Dj for 
the distribution obtained after the recursive step, so that Do = D and D = and let Uj 
(resp. Vj) denote an upper bound on dTv(-C*j)^) (resp. d^Y{Dj.,D)). Note that by the guarantee 
of Theorem 7.3 and applying a triangle inequality for vj, one gets the following recurrence relations 
for {uj)o<j<k and {vj)o<j<k- 

1 

Uq — £, ^ 

Vo = 0, Vj+i = Qttj + 2uj + + Vj 

^®More precisely, 3 with probability 1—po, and fc + 2 with probability po, for an expected number (fc —l)po + 3 ~ k/2. 
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Solving this recurrence for Uk gives 


2k 


while one gets an upper bound on Vk by writing 


fc-i 


Ufc = -rr + 21 — 7^ a<7^ + 2Q! 


2k 


2k 


(7) 


1 


fc-i 


fc-i 


Vk = Vk - Vo = {Vj+i - Vj) = A:a + - ^ Uj + 2 ^ 

^ j=o 


j=0 

= 2ka + ( 1 - e — 2 ^ 


j=0 


k-l 


a 


+ 2E 


3=0 


< ^1 — £ + 2 — 1 + a + (3e^ + IGe^a + 48ea^ + IG/ca^^ 


<ka 


where we used the expression (7) for Uj. Since a < < |, we can bound the rightmost terms as 

IG/ca^ < ka, 48ea^ < 48e® and IGe^a < IGe^, so that 


1 


2k 


Vk < 1-r ^ + 3e'^ + IGe^ + 48e^ < 1-r e + 3/ca + 23e 


1 


2k 


( 8 ) 


def 


It remains to choose k and a; to get Uk < £ 2 , set k =‘ log < log ^ + 1 and a ^£ 2 £‘^, 

so that ^ < (1 — e^)e 2 and 2 q; < 6 ^ 62 - Plugging these values in ( 8 ), 


Vk < f 1 — e + -k£2£‘^ + 23e^ = e — £2 + -k£2£‘^ + 24e^ < e — £2 + 27e^ 

(£ 2 <£) \ £ J 2 2 

where the last inequality comes from the fact that < i ^ ^ — 3- Therefore, we have 

dTv(7?fc,Z^) < £ 2 , ^TviDk, D) < £ — £2 + 27e^ as claimed. We turn to the number m of queries 
made along those k steps; from Theorem 7.3, this is at most 


k-l 

m < E 
3=0 


log « - 1 
log - 1 


<k- 


log ^ - 1 
log i - 1 


^2 1 




which concludes the proof. 


□ 


Note that in all four cases, as our improvers do not use any randomness of their own, they 
always output according to the same improved distribution: that is, after fixing the parameters 
e, £2 and the unknown distribution D, then D is uniquely determined, even across independent 
calls to the improver. 


7.1.1 Correcting uniformity on a subgroup 

Outline It is easy to observe that all the results above still hold when replacing by any finite 
Abelian group G. Thus, a natural question to turn to is whether one can generalize these results 
to the case where the unknown distribution is close to the uniform distribution on an arbitrary, 
unknown, subgroup H of the domain G. 


3G 














To do so, a first observation is that if H were known, and if furthermore a constant (expected) 
fraction of the samples were to fall within it, then one could directly apply our previous results 
by conditioning samples on being in H, using rejection sampling. The results of this section show 
how to achieve this “identification” of the subgroup with only a log(l/e) overhead in the sample 
complexity. At a high-level, the idea is to take a few samples, and argue that their greatest common 
divisor will (with high probability) be a generator of the subgroup. 


Details Let G be a hnite cyclic Abelian group of order n, and H C G a subgroup of order m. 
We denote by Uh the uniform distribution on this subgroup. Moreover, for a distribution D over 
G, we write Dj{ for the conditional distribution it induces on H, that is 

VxgG, Dh{x) = 

which is defined as long as D puts non-zero weight on H. The following lemma shows that if D is 
close to Uh, then so is Dh'- 


Lemma 7.5. Assume d^'y{D,1 /(h) < 1- Then dTy{D h,1^h) < dTv(-G,G//). 

Proof. First, observe that the assumption implies Dh is well-dehned; indeed, as d^y{D,1 /(h) = 
sup 5 C(^ {Uh{S) — D{S)), taking S = H yields 1 > dTy{D,UH) > Uh{H) — D{H) = 1 — D{H), and 
thus ~D{H) > 0. 

Rewriting the definition of d^yiDuMH)-, one gets dT^y{D,UH) = | (j2xeH ^(*)) 

so that 


2dTY (DhMh) = 

xeH 


Dh{x) - 


\H\ 


< Y,\Dh{x)-D{x)\+J2 


x&H 


x&H 


D{x) - 


\H\ 


= iDnix) - Dix)\ + 2dTyiD, Uh) - 

xeH 


= E D{x) 

x&H 


1 


D{H) 


- 1 


x^H 

+ 2dTy{D,UH)-{l-D{H)) 


= D{H) 


1 


- 1 


+ 2d^y{DMH)-{l-D{H)) 


\D{H) 

= |1 - D{H)\ + 2d^y{DMH) - (1 - D{H)) 

= 2dTv(-D,Giy). 


□ 


Let D be a distribution on G promised to be e-close to the uniform distribution Uh on some 
unknown subgroup H, for e < 2 c. For the sake of presentation, we hereafter without loss of 
generality identify G to Z^. Let h be the generator of H with smallest absolute values (when seen 
as an integer), so that H = {0, h, 2h, 3h, ... , (m — l)h}. 

Observe that D{H) > 1 — 2e, as 2dTy{D,UH) = J2x£H -^(®) “ + T){H^)] therefore, if H 

were known one could efficiently simulate sample access to Dh via rejection sampling, with only a 
constant factor overhead (in expectation) per sample. It would then become possible, as hinted in 
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the foregoing discussion, to correct uniformity on Dj{ (which is e-close to Uh by Lemma 7.5) via 
one of the previous algorithms for Abelian groups. The question remains to show how to find iL; 
or, equivalently, h. 


Algorithm 5 Algorithm Find-Generator-Subgroup 


Require: e € (0, ^ — c], SAMP/j with D e-close to uniform on some subgroup H ^ Zr, 
Ensure: Outputs a generator h oi H with probability 1 — 0(e) 

Draw k independent samples si,..., from D, for k 0(^log 


Compute h = gcd(si,..., Sk) 

return h 


Lemma 7.6. Let G, H he as before. There exists an algorithm (Algorithm 5) which, given e < 0.49 
as well as sample access to some distribution D over G, makes O^logb^ calls to the oracle and 

returns an element of G. Further, if dTy{D,UH) < s, then with probability at least 1 — 0(e) its 
output is a generator of H. 

Proof. In order to argue correctness of the algorithm, we will need the following well-known facts: 

Fact 7.7. Fix any k > 1, and let Pn,k be the probability that k independent numbers drawn uni¬ 
formly at random from [nl he relatively prime. Then pn k -^ tha (where C is the Riemann zeta 

’ n—>-00 

function). 

Fact 7.8. One has C,{x) = 1 ; and in particular = 1 ~ ^ • 

With this in hand, let k 0(^logi^, chosen so that ek = We break the 

analysis of our subgroup-finding algorithm in two cases: 

Case 1: \H\ = ©(1) This is the easy case: if H only contains constantly many elements (m is a 
constant of n and e), then after taking k samples si,..., ~ O, we have 

• si, ..., Sfc G R (event Ei) with probability at least (1 — e)^ = 1 — 0{ks); 

• the probability that there exists an element of H not hit by any of the s^’s is at most, by a 
union bound, 

^(l-D(x))'' < m (l - — + eY = 
x&H \ m J 

for e sufficiently small (e <C ^). Let E 2 be the event each element of H appears amongst the 
samples. 

Overall, with probability 1 — 0{ke) (conditioning on Ei and E 2 ), our set of samples is exactly FI, 
and gcd(si ,... ,Sk) = gcd(iL) = h. 
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Case 2: \H\ = a;(l) This amounts to saying that h = o(n). In this case, taking again k samples 
si,..., Sfc ~ D and denoting by h their greatest common divisor: 

• si,..., Sk ^ H (event Ei) with probability at least (1 — s)^ = 1 — 0{ke) as before; 

• conditioned on Ei, note that if the Sj’s were uniformly distributed in H, then the probability 
that h = h would be exactly - as gcd(/ia,..., hb) = h if and only if gcd(a ,... ,b) = 1, 
i.e. if a,..., 6 are relatively prime. In this ideal scenario, therefore, we would have 

Pr[gcd(ai,..., St) = ft I El ] = ^ ^ = 1 _ o(i) 


by Fact 7.7 and our assumption h = o{n). 

To adapt this result to our case - where si,... ~ Dh (as we conditioned on Ei), it is 

sufficient to observe that by the Data Processing Inequality for total variation distance. 


Pr [gcd(si,... ,Sfc) = h] - Pr [gcd(si,..., = h] 


< dTV 


'H ) — ks 


so that in our case 


Pr[gcd(si, ...,Sk) = h\Ei]> pi^^k-ke ■ ■ > y^-ke = l-0[^\ =1-0 --^ (9) 




log?. 


In either case, with probability at least 1 — 0(e), we find a generator h of H, acting as a 
(succinct) representation of H which allows us to perform rejection sampling. □ 

This directly implies the theorem below: any sampling improver for uniformity on a group 
directly yields an improver for uniformity on an unknown subgroup, with essentially the same 
complexity. 

Theorem 7.9. Suppose we have an {£,£\, £ 2 )-sampling improver for uniformity over Abelian finite 
cyclic groups, with guery complexity q{£,£i,£ 2 ). Then there exists an {£,£ 1 , £ 2 )-sampling improver 
for uniformity on subgroups, with query complexity 

O^log^ + q{£,£l,£2)logq{£,£i,£2) 

Proof. Proof is straightforward (rejection sampling over the subgroup, once identified: constant 
probability of hitting it, so by trying at most O(logg) draws per samples before outputting FAIL, 
one can provide a sample from Dh to the original algorithm with probability 1 — 1/lOg, for each 
of the (at most) q queries). □ 


7.2 Comparison with randomness extractors 

In the randomness extractor model, one is provided with a source of imperfect random bits (and 
sometimes an additional source of completely random bits), and the goal is to output as many 
random bits as possible that are close to uniformly distributed. In the distribution corrector model, 
one is provided with a distribution that is close to having a property V, and the goal is to have the 
ability to generate a similar distribution that has property V. 
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One could therefore view extractors as sampling improvers for the property of uniformity of 
distributions (i.e. V = {lAn})'. indeed, both sampling correctors and extractors attempt to minimize 
the use of extra randomness. However, there are significant differences between the two settings. 
A first difference is that randomness extractors assume a lower bound on the min-entropy^^ of the 
input distribution, whereas sampling improvers assume the distribution to be e-close to uniform 
in total variation distance. Note that the two assumptions are not comparable.^®Secondly, in both 
the extractor and sampling improver models, since the entropy of the output distribution should 
be larger, one would either need more random bits from the weak random source or additional 
uniform random bits. Our sampling improvers do not use any extra random bits, which is also 
the case in deterministic extractors, but not in other extractor constructions. However, unlike the 
extractor model, in the sampling improver model, there is no bound on the number of independent 
samples one can take from the original distribution. Tight bounds and impossibility results are 
known for both general and deterministic extractors [Vadl2, RTSOO], in particular in terms of the 
amount of additional randomness required. Because of the aforementioned differences in both the 
assumptions on and access to the input distribution, these lower bounds do not apply to our setting 
- which explains why our sampling improvers avoid this need for extra random bits. 

7.3 Monotone distributions and randomness scarcity 

In this section, we describe how to utilize a (close-to-monotone) input distribution to obtain the 
uniform random samples some of our previous correctors and improvers need. This is at the price 
of a 0(log n)-sample overhead per draw, and follows the same general approach as in Theorem 7.1. 
We observe that even if this seems to defeat the goal (as, with this many samples, one could even 
learn the distribution, as stated in Lemma 5.1), this is not actually the case: indeed, the procedure 
below is meant as a subroutine for these very same correctors, emancipating them from the need 
for truly additional randomness - which they would require otherwise, e.g. to generate samples 
from the corrected or learnt distribution. 

Lemma 7.10 (Randomness from (almost) monotone). There exists a procedure which, which, given 
£ S [0,1/3) and 6 > 0, as well as sample access to a distribution D guaranteed to be e-close to 
monotone, either returns 

• “point mass, " if D is e-close to the point distribution^^ on the first element; 

• or a uniform random sample from [n]; 

with probability of failure at most 6. The procedure makes o(^-^ log samples from D in the first 
case, and log in the second. 

Proof. By taking 0(log(l/(5)/e^) samples, the algorithm starts by approximating by F the cdf F 
of the distribution up to an additive | in .^oo- Then, defining 

m min | f E [n] : .F(z) > 1 — ^ | 

^^The min-entropy of a distribution D is defined as Hoo{D) = log . 

^®For example, the min-entropy of a distribution which is e-close to uniform can range from log(l/e) to fl(logn). 
Conversely, the distance to uniformity of a distribution which has high min-entropy can also vary significantly: there 
exist distributions with min-entropy Slflogn) but which are respectively n(l)-far from and 0(l/n)-close to uniform. 

^®The Dirac distribution 3i defined by 5i(l) = 1, which can be trivially sampled from without the use for any 
randomness by always outputting 1. 
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so that F{m) > 1 — According to the value of m, we consider two cases: 

• If m = 1, then D is e-close to the (monotone) distribution (5i which has all weight on the first 
element; this means we have effectively learnt the distribution, and can thereafter consider, 
for all purposes, in lieu of D. 

def 

• If m > 1, then -D(l) < 1 — | and we can partition the domain in two sets So = {1 ,... ,k} 
and 5i {k, ..., n}, by setting 

k min | i G [n] : F{i) < 1 — ^ 


By our previous check we know that this quantity is well-defined. Further, this implies that 
D{So) < 1 — I and D{Si) > | (the actual values being known up to ±|). D being e-close to 
monotone, it also must be the case that 


D{So) > 


k 


k + l 


1 - 


3e 


- 2e > 


1 


1 - 


3e 


1 19e 

- 2e >- 

- 2 8 


since F{k -|- 1) > 1 — § implies £>({!,..., k}) + D{k) = F>({1,..., -|- 1}) > 1 — By set 


def 


ting p = D(So), this means we have access to a Bernoulli random variable with parameter p G 
i - 3e, 1 - 


. As in the proof of Theorem 7.1 (the constant c being replaced by min^|, ^ — 3e^), 
one can then leverage this to output with probability at least 1 — 5 a, uniform random number 
s G [n] using log samples—and in expectation. □ 
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8 Open questions 


Correcting vs. Learning A main direction of interest would be to obtain more examples of 
properties for which correcting is strictly more efficient than (agnostic or non-agnostic) learning. 
Such examples would be insightful even if they are more efficient only in terms of the number 
of samples required from the original distribution, without considering the additional randomness 
requirements for generating the distribution. More specifically, one may ask whether there exists a 
sampling corrector for monotonicity of distributions (i.e., one that beats the learning bound from 
Lemma 5.1) for all e < 1 which uses at most o(logn) samples from the original distribution per 
sample output of the corrected distribution. 

The power of additional queries Following the line of work pursued in [CFGM13, CRS14, 
CR14] (in the setting of distribution testing), it is natural in many situations to consider addi¬ 
tional types of queries to the input distribution: e.g., either conditional queries (getting a sample 
conditioned on a specific subset of the domain) or cumulative queries (granting query access to 
the cumulative distribution function, besides the usual sampling). By providing algorithms with 
this extended access to the underlying probability distribution, can one obtain faster sampling 
correctors for specihc properties, as we did in Section 5.3 in the case of monotonicity? 

Confidence boosting Suppose that there exists, for some property V, a sampling improver A 
that only guarantees a success probability^*^ of 2/3. Using A as a black-box, can one design a 
sampling improver A' which succeeds with probability 1 — <5, for any 5? 

More precisely, let M be a batch improver for V which, when queried, makes q[ei,e 2 ) queries 
and provides f > 1 samples, with success probability at least 2/3. Having black-box access to A, 
can we obtain a batch improver A! which on input <5 > 0 provides t' > 1 samples, with success 
probability at least 1 — 51 If so, what is the best t' one can achieve, and what is the minimum 
query complexity of A! one can get (as a function of q[-^ •), t' and 5)1 

This is known for property testing (by running the testing algorithm independently 0(log(l/5)) 
times and taking the majority vote), as well as for learning (again, by running the learning algorithm 
many times, and then doing hypothesis testing, e.g. a la [DK14, Theorem 19]). However, these 
approaches do not straightforwardly generalize to sampling improvers or correctors, respectively 
because the output is not a single bit, and as we only obtain a sequence of samples (instead of an 
actual, fully-specified hypothesis distribution). 


note that the case of interest here is of batch sampling improvers: indeed, in order to generate a single draw, 
a sampling improver acts in a non-trivial way only if the parameter e is greater than its failure probability S. If not, 
a draw from the original distribution already satisfies the requirements. 
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A On convolutions of distributions over an Abelian finite cyclic 
group 

Definition A.l. For any two probability distributions Di, D 2 over a finite group G (not necessarily 
Abelian), the convolution of Di and D 2 , denoted Di * D 2 , is the distribution on G defined by 

Di*D2{x) = Y^Di{xg-^)D2{9) 

g&G 

In particular, if G is Abelian, Di * D 2 = D 2 * Di. 

Fact A.2. The convolution satisfies the following properties: 

(i) it is assoeiative: 


VZ)i,D2,D3, Di*{D 2* Dfi) = {Di* D 2 ) * D'i = Di* D 2 * (10) 

(a) it has a (unique) absorbing element, the uniform distribution U{G): 

VD, U{G)*D = U{G) (11) 

(Hi) it can only decrease the total variation: 

\/Di,D 2,D3, dTy{Di* D 2 , Di* D^) < dTY{D2, D^) (12) 


For more on convolutions of distributions over finite groups, see for instance [Dia88] or [BOCLR08]. 

Fact A.3 ([Macl3b]). Let G be a finite Abelian group, and Di, D 2 two probability distributions 
over G. Then, the convolution of Di and D 2 satisfies 


dTv{U{G ), Di * D 2 ) < 2dTy{U{G ), Di)dTv(^(G), D 2 ) (13) 


where IL{G) denotes the uniform distribution on G. Furthermore, this bound is tight. 

Claim A.4. For e E (0,^), there exists a distribution D on %n sueh that d'TY{D,U) = e, yet 
dTv(^, D*D)=e + \e‘^ + 0{e'^) > e. 

Proof. Inspired by—and following—a question on MathOverflow ([Macl3a]). Setting 6 = 1 — e > 
1/2, and taking Da to be uniform on a subset A of with |A| = 6n, one gets that dTy{DA,L() = e, 
and yet 


dTv{DA, Da * Da) 


2 11 -^^ “ Da * DaWi 


)_ kA(g) 
1^1 


^(ff) 

|A |2 


= 1 - 



^r(a) 

a&A 


where r{g) is the number of representations of g' as a sum of two elements of A (as one can show 
that Da * DA{g) = \ A\~‘^ ^(ff))- Fix ^ to be the interval of length 5n centered around n/2, that is 
A = {£,..., L} with 


e 


def 1 



^ def 1 + 5 

1j — - 

2 


n 


Computing the quantity Y)a&A amounts to counting the number of pairs {a,b) G A x A whose 
sum (modulo n) lies in A. For convenience, define k = {1 — 5)n and K = Sn: 
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• ioT k < a < K, a + b ranges from £ + a<L to L + a>£ + n, so that modulo n exactly 
|A| — |y4^| = {25 — l)n elements of A are reached (each of them exactly once); 

• for £ < a < A:, a + 6 ranges from £ + a<-LtoL + a<t' + n, so that the elements of A not 
obtained are those in the interval {£, f + a — 1} - there are a of them - and again the others 
are obtained exactly once; 

• for ill < a < L, a + 6 ranges from L<t' + a<n to-L + a<iir + n, so that the elements of A 
not obtained are those in the interval {L + a — n + 1, L} - there are n — a of them ~ and as 
before the others are hit exactly once. 

It follows that 


fc-l L 

r(a) = (K — k + l)(2(i — l)n + '^{Sn — a) + ^ {5n — (n — a)) 

a^A a=i a=K-\-l 

k-1 L 

= {25 - 1){25 - l)n^ + E {5n — a) + ^ (a — (1 — 5)n) (the 2 sums are equal) 

a=£ a=K-\-l 

2 

= {25 - + 2 • ^{75 - 3)(1 - <5) + 0{n) 

8 

= i (a{ 45 ‘^ _ 45 + 1 ) + (75 - 3)(1 - 5)) + 0{n) = ^ (95^ - 65 + l) + 0{n) 

= — 3e + + 0{n) 

and thus 


dTv(A^A, Da * Da) = 1 


1 

1^ 


E 1 

asA 


1 — 3e + 


+ 0 



(as e = a;(l/^/n)). 


□ 


B Formal definitions: learning and testing 

In this appendix, we define precisely the notions of testing, tolerant testing, learning and proper 
learning of distributions over a domain [n]. 

Definition B.l (Testing). Fix any property V of distributions, and let ORACLE^) be an oracle 
providing some type of access to D. A q-sample testing algorithm for P is a randomized algorithm 
T which takes as input n, e G (0,1], as well as access to ORACLE^. After making at most q{e,n) 
calls to the oracle, T outputs either ACCEPT or REJECT, such that the following holds; 

• ii D gV,T outputs ACCEPT with probability at least 2/3; 

• if dTv(-A^)^) > £) T outputs REJECT with probability at least 2/3. 

We shall also be interested in tolerant testers - roughly, algorithms robust to a relaxation of the 
first item above; 

Definition B.2 (Tolerant testing). Fix property V and ORACLE/^ as above. A q-sample tolerant 
testing algorithm for P is a randomized algorithm T which takes as input n, 0 < ei < 62 < 1 , ss 
well as access to ORACLE^). After making at most q{si,e 2 ,n) calls to the oracle, T outputs either 
ACCEPT or REJECT, such that the following holds; 
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• if dTv(^)^) < T outputs ACCEPT with probability at least 2/3; 

• if > e 2 , T outputs REJECT with probability at least 2/3. 

Note that the above definition is quite general, and can be instantiated for different types of 
oracle access to the unknown distribution. In this work, we are mainly concerned with two such 
settings, namely the sampling oracle access and Cumulative Dual oracle access: 

Definition B.3 (Standard access model (sampling)). Let D be a fixed distribution over [re]. A 
sampling oracle for D is an oracle SAMP/j defined as follows: when queried, SAMP^ returns an 
element x G [re], where the probability that x is returned is D{x) independently of all previous calls 
to the oracle. 

Definition B.4 (Cumulative Dual access model [BKR04, CR14]). Let ZJ be a fixed distribution 
over [re]. A cumulative dual oracle for D is a pair of oracles (SAMP^, CEVALo) defined as follows: 
the sampling oracle SAMP/j behaves as before, while the evaluation oracle CEVAL^) takes as input 
a query element j S [re], and returns the value of the cumulative distribution function (cdf) at j. 
That is, it returns the probability weight that the distribution puts on [j], D{\j]) = D{i). 

Another class of algorithms we consider is that of (proper) learners. We give the precise definition 
below: 

Definition B.5 (Learning). Let C be a class of probability distributions and D € C be an unknown 
distribution. Let also "R be a hypothesis class of distributions. A q-sample learning algorithm for 
C is a randomized algorithm C which takes as input n, £,5 & (0,1), as well as access to SAMP^) 
and outputs the description of a distribution D ^ T-L such that with probability at least 1 — 5 one 
has dTV^-DjA)^ < e. 

If in addition 71 '^C, then we say £ is a proper learning algorithm. 

The exact formalization of what learning a probability distribution means has been considered 
in Kearns et ah [KMR'’'94]. We note that in their language, the variant of learning this paper is 
most closely related to is learning to generate. 


C Omitted proofs 

This section contains the proofs of some of the technical lemmata of the paper, omitted for the 
sake of conciseness. 


Proof of Eg. (2). Fix a > 0, and let Di, D 2 be two arbitrary distributions on [re]. Recall that 
^aiDj) is the Birge flattening of distribution Dj (on the decomposition with parameter a). 


2dTv(^«(£»l), ^a{D2)) = Y. \^UDl)ii) 

i=l 

i 


^a{D2)ii)\ = E E 


k=l iGlk 


Dijlk) 

14! 


= Y\Di{h)-D2{h)\ = Y 


YiDi{i)-D2{i)) 


k=l 


k=l 


iGlk 


D2{Ik) 

\h\ 


£ Tl 

< E E - ^2(^)| = E 1^1 W - ^2(i)| = 2d^y{D,,D2). 

k=l i£lk 
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(we remark that Eq. (2) could also be obtained directly by applying the data processing inequality 
for total variation distance (Fact 2.1) to Di, D 2 , for the transformation <l>o(-).) □ 


Proof of Corollary 2.5. Let D be e-close to monotone, and D' be a monotone distribution such 
that d^yiD, D') = r] < s. By Eq. (2), we have 

dTv(^a(^),^a(i^')) < dTv(^,^') = r] (14) 

proving the last part of the claim (since 4 >q(H') is easily seen to be monotone). 

Now, by the triangle inequality, 

dTv(^, ^aiD')) < dTv{D, D') + dTv(^', d>a(^')) + dTv($a(I?'), ^a{D)) 

< rj + a + rj 
<2£ + a 

where the last inequality uses the assumption on D' and Theorem 2.4 applied to it. □ 
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